WorldWideScience

Sample records for automatic speech recognition

  1. Techniques for automatic speech recognition

    Science.gov (United States)

    Moore, R. K.

    1983-05-01

    A brief insight into some of the algorithms that lie behind current automatic speech recognition system is provided. Early phonetically based approaches were not particularly successful, due mainly to a lack of appreciation of the problems involved. These problems are summarized, and various recognition techniques are reviewed in the contect of the solutions that they provide. It is pointed out that the majority of currently available speech recognition equipments employ a "whole-word' pattern matching approach which, although relatively simple, has proved particularly successful in its ability to recognize speech. The concepts of time-normalizing plays a central role in this type of recognition process and a family of such algorithms is described in detail. The technique of dynamic time warping is not only capable of providing good performance for isolated word recognition, but how it is also extended to the recognition of connected speech (thereby removing one of the most severe limitations of early speech recognition equipment).

  2. Dynamic Automatic Noisy Speech Recognition System (DANSR)

    OpenAIRE

    Paul, Sheuli

    2014-01-01

    In this thesis we studied and investigated a very common but a long existing noise problem and we provided a solution to this problem. The task is to deal with different types of noise that occur simultaneously and which we call hybrid. Although there are individual solutions for specific types one cannot simply combine them because each solution affects the whole speech. We developed an automatic speech recognition system DANSR ( Dynamic Automatic Noisy Speech Recognition System) for hybri...

  3. Automatic speech recognition a deep learning approach

    CERN Document Server

    Yu, Dong

    2015-01-01

    This book summarizes the recent advancement in the field of automatic speech recognition with a focus on discriminative and hierarchical models. This will be the first automatic speech recognition book to include a comprehensive coverage of recent developments such as conditional random field and deep learning techniques. It presents insights and theoretical foundation of a series of recent models such as conditional random field, semi-Markov and hidden conditional random field, deep neural network, deep belief network, and deep stacking models for sequential learning. It also discusses practical considerations of using these models in both acoustic and language modeling for continuous speech recognition.

  4. Development of a System for Automatic Recognition of Speech Development of a System for Automatic Recognition of Speech

    Directory of Open Access Journals (Sweden)

    Michal Kuba

    2003-01-01

    Full Text Available The article gives a review of a research on processing and automatic recognition of speech signals (ARR at the Department of Telecommunications of the Faculty of Electrical Engineering, University of iilina. On-going research is oriented to speech parametrization using 2-dimensional cepstral analysis, and to an application of HMMs and neural networks for speech recognition in Slovak language. The article summarizes achieved results and outlines future orientation of our research in automatic speech recognition.The article gives a review of a research on processing and automatic recognition of speech signals (ARR at the Department of Telecommunications of the Faculty of Electrical Engineering, University of Zilina. On-going research is oriented to speech parametrization using 2-dimensional cepstral analysis, and to an application of HMMs and neural networks for speech recognition in Slovak language. The article summarizes achieved results and outlines future orientation of our research in automatic speech recognition.

  5. Punjabi Automatic Speech Recognition Using HTK

    Directory of Open Access Journals (Sweden)

    Mohit Dua

    2012-07-01

    Full Text Available This paper aims to discuss the implementation of an isolated word Automatic Speech Recognition system (ASR for an Indian regional language Punjabi. The HTK toolkit based on Hidden Markov Model (HMM, a statistical approach, is used to develop the system. Initially the system is trained for 115 distinct Punjabi words by collecting data from eight speakers and then is tested by using samples from six speakers in real time environments. To make the system more interactive and fast a GUI has been developed using JAVA platform for implementing the testing module. The paper also describes the role of each HTK tool, used in various phases of system development, by presenting a detailed architecture of an ASR system developed using HTK library modules and tools. The experimental results show that the overall system performance is 95.63% and 94.08%.

  6. a New Structure for Automatic Speech Recognition

    Science.gov (United States)

    Duchnowski, Paul

    Speech is a wideband signal with cues identifying a particular element distributed across frequency. To capture these cues, most ASR systems analyze the speech signal into spectral (or spectrally-derived) components prior to recognition. Traditionally, these components are integrated across frequency to form a vector of "acoustic evidence" on which a decision by the ASR system is based. This thesis develops an alternate approach, post-labeling integration. In this scheme, tentative decisions or labels, of the identity of a given speech element are assigned in parallel by sub -recognizers, each operating on a band-limited portion of the speech waveform. Outputs of these independent channels are subsequently combined (integrated) to render the final decision. Remarkably good recognition of bandlimited nonsense syllables by humans leads to the consideration of this method. It also allows potentially more accurate parameterization of the speech waveform and simultaneously robust estimation of parameter probabilities. The algorithm also represents an attempt to make explicit use of redundancies in speech. Three basic methods of parameterizing the bandlimited input of the sub-recognizers were considered, focusing respectively on LPC and cepstrum coefficients, and parameters based on the autocorrelation function. Four sub-recognizers were implemented as discrete Hidden Markov Model (HMM) systems. Maximum A Posteriori (MAP) hypothesis testing approach was applied to the problem of integrating the individual sub-recognizer decisions on a frame by frame basis. Final segmentation was achieved by a secondary HMM. Five methods of estimating the probabilities necessary for MAP integration were tested. The proposed structure was applied to the task of phonetic, speaker-independent, continuous speech recognition. Performance for several combinations of parameterization schemes and integration methods was measured. The best score of 58.5% on a 39 phone alphabet is roughly

  7. Confidence Measures for Automatic and Interactive Speech Recognition

    OpenAIRE

    Sánchez Cortina, Isaías

    2016-01-01

    [EN] This thesis work contributes to the field of the {Automatic Speech Recognition} (ASR). And particularly to the {Interactive Speech Transcription} and {Confidence Measures} (CM) for ASR. The main goals of this thesis work can be summarised as follows: 1. To design IST methods and tools to tackle the problem of improving automatically generated transcripts. 2. To assess the designed IST methods and tools on real-life tasks of transcription in large educational repositories of vide...

  8. Speaker-Machine Interaction in Automatic Speech Recognition. Technical Report.

    Science.gov (United States)

    Makhoul, John I.

    The feasibility and limitations of speaker adaptation in improving the performance of a "fixed" (speaker-independent) automatic speech recognition system were examined. A fixed vocabulary of 55 syllables is used in the recognition system which contains 11 stops and fricatives and five tense vowels. The results of an experiment on speaker…

  9. Automatic Emotion Recognition in Speech: Possibilities and Significance

    Directory of Open Access Journals (Sweden)

    Milana Bojanić

    2009-12-01

    Full Text Available Automatic speech recognition and spoken language understanding are crucial steps towards a natural humanmachine interaction. The main task of the speech communication process is the recognition of the word sequence, but the recognition of prosody, emotion and stress tags may be of particular importance as well. This paper discusses thepossibilities of recognition emotion from speech signal in order to improve ASR, and also provides the analysis of acoustic features that can be used for the detection of speaker’s emotion and stress. The paper also provides a short overview of emotion and stress classification techniques. The importance and place of emotional speech recognition is shown in the domain of human-computer interactive systems and transaction communication model. The directions for future work are given at the end of this work.

  10. Automatic Phonetic Transcription for Danish Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    to acquire and expensive to create. For languages with productive compounding or agglutinative languages like German and Finnish, respectively, phonetic dictionaries are also hard to maintain. For this reason, automatic phonetic transcription tools have been produced for many languages. The quality...... of automatic phonetic transcriptions vary greatly with respect to language and transcription strategy. For some languages where the difference between the graphemic and phonetic representations are small, graphemic transcriptions can be used to create ASR systems with acceptable performance. In other languages......, syllabication, stød and several other suprasegmental features (Kirkedal, 2013). Simplifying the transcriptions by filtering out the symbols for suprasegmental features in a post-processing step produces a format that is suitable for ASR purposes. eSpeak is an open source speech synthesizer originally created...

  11. Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers

    Science.gov (United States)

    Caballero Morales, Santiago Omar; Cox, Stephen J.

    2009-12-01

    Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR) systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1) a set of "metamodels" that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2) a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.

  12. Automatic speech recognition for radiological reporting

    International Nuclear Information System (INIS)

    Large vocabulary speech recognition, its techniques and its software and hardware technology, are being developed, aimed at providing the office user with a tool that could significantly improve both quantity and quality of his work: the dictation machine, which allows memos and documents to be input using voice and a microphone instead of fingers and a keyboard. The IBM Rome Science Center, together with the IBM Research Division, has built a prototype recognizer that accepts sentences in natural language from 20.000-word Italian vocabulary. The unit runs on a personal computer equipped with a special hardware capable of giving all the necessary computing power. The first laboratory experiments yielded very interesting results and pointed out such system characteristics to make its use possible in operational environments. To this purpose, the dictation of medical reports was considered as a suitable application. In cooperation with the 2nd Radiology Department of S. Maria della Misericordia Hospital (Udine, Italy), a system was experimented by radiology department doctors during their everyday work. The doctors were able to directly dictate their reports to the unit. The text appeared immediately on the screen, and eventual errors could be corrected either by voice or by using the keyboard. At the end of report dictation, the doctors could both print and archive the text. The report could also be forwarded to hospital information system, when the latter was available. Our results have been very encouraging: the system proved to be robust, simple to use, and accurate (over 95% average recognition rate). The experiment was precious for suggestion and comments, and its results are useful for system evolution towards improved system management and efficency

  13. Mixed Bayesian Networks with Auxiliary Variables for Automatic Speech Recognition

    OpenAIRE

    Stephenson, Todd Andrew; Magimai.-Doss, Mathew; Bourlard, Hervé

    2001-01-01

    Standard hidden Markov models (HMMs), as used in automatic speech recognition (ASR), calculate their emission probabilities by an artificial neural network (ANN) or a Gaussian distribution conditioned on the hidden state variable, considering the emissions independent of any other variable in the model. Recent work showed the benefit of conditioning the emission distributions on a discrete auxiliary variable, which is observed in training and hidden in recognition. Related work has shown the ...

  14. Robust Automatic Speech Recognition in Impulsive Noise Environment

    Institute of Scientific and Technical Information of China (English)

    DINGPei; CAOZhigang

    2005-01-01

    This paper presents an efficient method to directly suppress the effect of impulsive noise for robust Automatic speech recognition (ASR). In this method, according to the noise sensitivity of each feature dimension,the observation vectors are divided into several parts, eachof which is assigned to a proper threshold. In recognition stage, the unreliable probability preponderance of incorrect competing path caused by impulsive noise is eliminated by Flooring observation probability (FOP) of eachfeature sub-vector at the Gaussian mixture level, so that the correct path will recover the priority of being chosen in decoding. Experimental results also demonstrate that the proposed method can significantly improve the recognition accuracy both in machinegun noise and simulated impulsive noise environments, while maintaining high performance for clean speech recognition.

  15. Speech recognition for embedded automatic positioner for laparoscope

    Science.gov (United States)

    Chen, Xiaodong; Yin, Qingyun; Wang, Yi; Yu, Daoyin

    2014-07-01

    In this paper a novel speech recognition methodology based on Hidden Markov Model (HMM) is proposed for embedded Automatic Positioner for Laparoscope (APL), which includes a fixed point ARM processor as the core. The APL system is designed to assist the doctor in laparoscopic surgery, by implementing the specific doctor's vocal control to the laparoscope. Real-time respond to the voice commands asks for more efficient speech recognition algorithm for the APL. In order to reduce computation cost without significant loss in recognition accuracy, both arithmetic and algorithmic optimizations are applied in the method presented. First, depending on arithmetic optimizations most, a fixed point frontend for speech feature analysis is built according to the ARM processor's character. Then the fast likelihood computation algorithm is used to reduce computational complexity of the HMM-based recognition algorithm. The experimental results show that, the method shortens the recognition time within 0.5s, while the accuracy higher than 99%, demonstrating its ability to achieve real-time vocal control to the APL.

  16. Automatic speech recognition for report generation in computed tomography

    International Nuclear Information System (INIS)

    Purpose: A study was performed to compare the performance of automatic speech recognition (ASR) with conventional transcription. Materials and Methods: 100 CT reports were generated by using ASR and 100 CT reports were dictated and written by medical transcriptionists. The time for dictation and correction of errors by the radiologist was assessed and the type of mistakes was analysed. The text recognition rate was calculated in both groups and the average time between completion of the imaging study by the technologist and generation of the written report was assessed. A commercially available speech recognition technology (ASKA Software, IBM Via Voice) running of a personal computer was used. Results: The time for the dictation using digital voice recognition was 9.4±2.3 min compared to 4.5±3.6 min with an ordinary Dictaphone. The text recognition rate was 97% with digital voice recognition and 99% with medical transcriptionists. The average time from imaging completion to written report finalisation was reduced from 47.3 hours with medical transcriptionists to 12.7 hours with ASR. The analysis of misspellings demonstrated (ASR vs. medical transcriptionists): 3 vs. 4 for syntax errors, 0 vs. 37 orthographic mistakes, 16 vs. 22 mistakes in substance and 47 vs. erroneously applied terms. Conclusions: The use of digital voice recognition as a replacement for medical transcription is recommendable when an immediate availability of written reports is necessary. (orig.)

  17. Modelling Errors in Automatic Speech Recognition for Dysarthric Speakers

    Directory of Open Access Journals (Sweden)

    Santiago Omar Caballero Morales

    2009-01-01

    Full Text Available Dysarthria is a motor speech disorder characterized by weakness, paralysis, or poor coordination of the muscles responsible for speech. Although automatic speech recognition (ASR systems have been developed for disordered speech, factors such as low intelligibility and limited phonemic repertoire decrease speech recognition accuracy, making conventional speaker adaptation algorithms perform poorly on dysarthric speakers. In this work, rather than adapting the acoustic models, we model the errors made by the speaker and attempt to correct them. For this task, two techniques have been developed: (1 a set of “metamodels” that incorporate a model of the speaker's phonetic confusion matrix into the ASR process; (2 a cascade of weighted finite-state transducers at the confusion matrix, word, and language levels. Both techniques attempt to correct the errors made at the phonetic level and make use of a language model to find the best estimate of the correct word sequence. Our experiments show that both techniques outperform standard adaptation techniques.

  18. On Automatic Voice Casting for Expressive Speech: Speaker Recognition vs. Speech Classification

    OpenAIRE

    Obin, Nicolas; Roebel, Axel; Bachman, Grégoire

    2014-01-01

    This paper presents the first large-scale automatic voice casting system, and explores the adaptation of speaker recognition techniques to measure voice similarities. The proposed system is based on the representation of a voice by classes (e.g., age/gender, voice quality, emotion). First, a multi-label system is used to classify speech into classes. Then, the output probabilities for each class are concatenated to form a vector that represents the vocal signature of a speech recording. Final...

  19. Post-error Correction in Automatic Speech Recognition Using Discourse Information

    OpenAIRE

    Kang, S.; Kim, J.-H.; Seo, J.

    2014-01-01

    Overcoming speech recognition errors in the field of human�computer interaction is important in ensuring a consistent user experience. This paper proposes a semantic-oriented post-processing approach for the correction of errors in speech recognition. The novelty of the model proposed here is that it re-ranks the n-best hypothesis of speech recognition based on the user's intention, which is analyzed from previous discourse information, while conventional automatic speech reco...

  20. Speech recognition and understanding

    Energy Technology Data Exchange (ETDEWEB)

    Vintsyuk, T.K.

    1983-05-01

    This article discusses the automatic processing of speech signals with the aim of finding a sequence of works (speech recognition) or a concept (speech understanding) being transmitted by the speech signal. The goal of the research is to develop an automatic typewriter that will automatically edit and type text under voice control. A dynamic programming method is proposed in which all possible class signals are stored, after which the presented signal is compared to all the stored signals during the recognition phase. Topics considered include element-by-element recognition of words of speech, learning speech recognition, phoneme-by-phoneme speech recognition, the recognition of connected speech, understanding connected speech, and prospects for designing speech recognition and understanding systems. An application of the composition dynamic programming method for the solution of basic problems in the recognition and understanding of speech is presented.

  1. Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and Neck Cancer

    Directory of Open Access Journals (Sweden)

    Andreas Maier

    2010-01-01

    Full Text Available In patients suffering from head and neck cancer, speech intelligibility is often restricted. For assessment and outcome measurements, automatic speech recognition systems have previously been shown to be appropriate for objective and quick evaluation of intelligibility. In this study we investigate the applicability of the method to speech disorders caused by head and neck cancer. Intelligibility was quantified by speech recognition on recordings of a standard text read by 41 German laryngectomized patients with cancer of the larynx or hypopharynx and 49 German patients who had suffered from oral cancer. The speech recognition provides the percentage of correctly recognized words of a sequence, that is, the word recognition rate. Automatic evaluation was compared to perceptual ratings by a panel of experts and to an age-matched control group. Both patient groups showed significantly lower word recognition rates than the control group. Automatic speech recognition yielded word recognition rates which complied with experts' evaluation of intelligibility on a significant level. Automatic speech recognition serves as a good means with low effort to objectify and quantify the most important aspect of pathologic speech—the intelligibility. The system was successfully applied to voice and speech disorders.

  2. Speech Acquisition and Automatic Speech Recognition for Integrated Spacesuit Audio Systems

    Science.gov (United States)

    Huang, Yiteng; Chen, Jingdong; Chen, Shaoyan

    2010-01-01

    A voice-command human-machine interface system has been developed for spacesuit extravehicular activity (EVA) missions. A multichannel acoustic signal processing method has been created for distant speech acquisition in noisy and reverberant environments. This technology reduces noise by exploiting differences in the statistical nature of signal (i.e., speech) and noise that exists in the spatial and temporal domains. As a result, the automatic speech recognition (ASR) accuracy can be improved to the level at which crewmembers would find the speech interface useful. The developed speech human/machine interface will enable both crewmember usability and operational efficiency. It can enjoy a fast rate of data/text entry, small overall size, and can be lightweight. In addition, this design will free the hands and eyes of a suited crewmember. The system components and steps include beam forming/multi-channel noise reduction, single-channel noise reduction, speech feature extraction, feature transformation and normalization, feature compression, model adaption, ASR HMM (Hidden Markov Model) training, and ASR decoding. A state-of-the-art phoneme recognizer can obtain an accuracy rate of 65 percent when the training and testing data are free of noise. When it is used in spacesuits, the rate drops to about 33 percent. With the developed microphone array speech-processing technologies, the performance is improved and the phoneme recognition accuracy rate rises to 44 percent. The recognizer can be further improved by combining the microphone array and HMM model adaptation techniques and using speech samples collected from inside spacesuits. In addition, arithmetic complexity models for the major HMMbased ASR components were developed. They can help real-time ASR system designers select proper tasks when in the face of constraints in computational resources.

  3. Studies on inter-speaker variability in speech and its application in automatic speech recognition

    Indian Academy of Sciences (India)

    S Umesh

    2011-10-01

    In this paper, we give an overview of the problem of inter-speaker variability and its study in many diverse areas of speech signal processing. We first give an overview of vowel-normalization studies that minimize variations in the acoustic representation of vowel realizations by different speakers. We then describe the universal-warping approach to speaker normalization which unifies many of the vowel normalization approaches and also shows the relation between speech production, perception and auditory processing. We then address the problem of inter-speaker variability in automatic speech recognition (ASR) and describe techniques that are used to reduce these effects and thereby improve the performance of speaker-independent ASR systems.

  4. Studies in automatic speech recognition and its application in aerospace

    Science.gov (United States)

    Taylor, Michael Robinson

    Human communication is characterized in terms of the spectral and temporal dimensions of speech waveforms. Electronic speech recognition strategies based on Dynamic Time Warping and Markov Model algorithms are described and typical digit recognition error rates are tabulated. The application of Direct Voice Input (DVI) as an interface between man and machine is explored within the context of civil and military aerospace programmes. Sources of physical and emotional stress affecting speech production within military high performance aircraft are identified. Experimental results are reported which quantify fundamental frequency and coarse temporal dimensions of male speech as a function of the vibration, linear acceleration and noise levels typical of aerospace environments; preliminary indications of acoustic phonetic variability reported by other researchers are summarized. Connected whole-word pattern recognition error rates are presented for digits spoken under controlled Gz sinusoidal whole-body vibration. Correlations are made between significant increases in recognition error rate and resonance of the abdomen-thorax and head subsystems of the body. The phenomenon of vibrato style speech produced under low frequency whole-body Gz vibration is also examined. Interactive DVI system architectures and avionic data bus integration concepts are outlined together with design procedures for the efficient development of pilot-vehicle command and control protocols.

  5. Man-system interface based on automatic speech recognition: integration to a virtual control desk

    International Nuclear Information System (INIS)

    This work reports the implementation of a man-system interface based on automatic speech recognition, and its integration to a virtual nuclear power plant control desk. The later is aimed to reproduce a real control desk using virtual reality technology, for operator training and ergonomic evaluation purpose. An automatic speech recognition system was developed to serve as a new interface with users, substituting computer keyboard and mouse. They can operate this virtual control desk in front of a computer monitor or a projection screen through spoken commands. The automatic speech recognition interface developed is based on a well-known signal processing technique named cepstral analysis, and on artificial neural networks. The speech recognition interface is described, along with its integration with the virtual control desk, and results are presented. (author)

  6. Developing and Evaluating an Oral Skills Training Website Supported by Automatic Speech Recognition Technology

    Science.gov (United States)

    Chen, Howard Hao-Jan

    2011-01-01

    Oral communication ability has become increasingly important to many EFL students. Several commercial software programs based on automatic speech recognition (ASR) technologies are available but their prices are not affordable for many students. This paper will demonstrate how the Microsoft Speech Application Software Development Kit (SASDK), a…

  7. Difficulties in Automatic Speech Recognition of Dysarthric Speakers and Implications for Speech-Based Applications Used by the Elderly: A Literature Review

    Science.gov (United States)

    Young, Victoria; Mihailidis, Alex

    2010-01-01

    Despite their growing presence in home computer applications and various telephony services, commercial automatic speech recognition technologies are still not easily employed by everyone; especially individuals with speech disorders. In addition, relatively little research has been conducted on automatic speech recognition performance with older…

  8. Post-error Correction in Automatic Speech Recognition Using Discourse Information

    Directory of Open Access Journals (Sweden)

    KANG, S.

    2014-05-01

    Full Text Available Overcoming speech recognition errors in the field of human�computer interaction is important in ensuring a consistent user experience. This paper proposes a semantic-oriented post-processing approach for the correction of errors in speech recognition. The novelty of the model proposed here is that it re-ranks the n-best hypothesis of speech recognition based on the user's intention, which is analyzed from previous discourse information, while conventional automatic speech recognition systems focus only on acoustic and language model scores for the current sentence. The proposed model successfully reduces the word error rate and semantic error rate by 3.65% and 8.61%, respectively.

  9. Assessing the efficacy of benchmarks for automatic speech accent recognition

    Directory of Open Access Journals (Sweden)

    Benjamin Bock

    2015-08-01

    Full Text Available Speech accents can possess valuable information about the speaker, and can be used in intelligent multimedia-based human-computer interfaces. The performance of algorithms for automatic classification of accents is often evaluated using audio datasets that include recording samples of different people, representing different accents. Here we describe a method that can detect bias in accent datasets, and apply the method to two accent identification datasets to reveal the existence of dataset bias, meaning that the datasets can be classified with accuracy higher than random even if the tested algorithm has no ability to analyze speech accent. We used the datasets by separating one second of silence from the beginning of each audio sample, such that the one-second sample did not contain voice, and therefore no information about the accent. An audio classification method was then applied to the datasets of silent audio samples, and provided classification accuracy significantly higher than random. These results indicate that the performance of accent classification algorithms measured using some accent classification benchmarks can be biased, and can be driven by differences in the background noise rather than the auditory features of the accents.

  10. Evaluating Automatic Speech Recognition-Based Language Learning Systems: A Case Study

    Science.gov (United States)

    van Doremalen, Joost; Boves, Lou; Colpaert, Jozef; Cucchiarini, Catia; Strik, Helmer

    2016-01-01

    The purpose of this research was to evaluate a prototype of an automatic speech recognition (ASR)-based language learning system that provides feedback on different aspects of speaking performance (pronunciation, morphology and syntax) to students of Dutch as a second language. We carried out usability reviews, expert reviews and user tests to…

  11. Fusing Eye-gaze and Speech Recognition for Tracking in an Automatic Reading Tutor

    DEFF Research Database (Denmark)

    Rasmussen, Morten Højfeldt; Tan, Zheng-Hua

    2013-01-01

    In this paper we present a novel approach for automatically tracking the reading progress using a combination of eye-gaze tracking and speech recognition. The two are fused by first generating word probabilities based on eye-gaze information and then using these probabilities to augment the...

  12. Machine learning based sample extraction for automatic speech recognition using dialectal Assamese speech.

    Science.gov (United States)

    Agarwalla, Swapna; Sarma, Kandarpa Kumar

    2016-06-01

    Automatic Speaker Recognition (ASR) and related issues are continuously evolving as inseparable elements of Human Computer Interaction (HCI). With assimilation of emerging concepts like big data and Internet of Things (IoT) as extended elements of HCI, ASR techniques are found to be passing through a paradigm shift. Oflate, learning based techniques have started to receive greater attention from research communities related to ASR owing to the fact that former possess natural ability to mimic biological behavior and that way aids ASR modeling and processing. The current learning based ASR techniques are found to be evolving further with incorporation of big data, IoT like concepts. Here, in this paper, we report certain approaches based on machine learning (ML) used for extraction of relevant samples from big data space and apply them for ASR using certain soft computing techniques for Assamese speech with dialectal variations. A class of ML techniques comprising of the basic Artificial Neural Network (ANN) in feedforward (FF) and Deep Neural Network (DNN) forms using raw speech, extracted features and frequency domain forms are considered. The Multi Layer Perceptron (MLP) is configured with inputs in several forms to learn class information obtained using clustering and manual labeling. DNNs are also used to extract specific sentence types. Initially, from a large storage, relevant samples are selected and assimilated. Next, a few conventional methods are used for feature extraction of a few selected types. The features comprise of both spectral and prosodic types. These are applied to Recurrent Neural Network (RNN) and Fully Focused Time Delay Neural Network (FFTDNN) structures to evaluate their performance in recognizing mood, dialect, speaker and gender variations in dialectal Assamese speech. The system is tested under several background noise conditions by considering the recognition rates (obtained using confusion matrices and manually) and computation time

  13. Assessment of Severe Apnoea through Voice Analysis, Automatic Speech, and Speaker Recognition Techniques

    Science.gov (United States)

    Fernández Pozo, Rubén; Blanco Murillo, Jose Luis; Hernández Gómez, Luis; López Gonzalo, Eduardo; Alcázar Ramírez, José; Toledano, Doroteo T.

    2009-12-01

    This study is part of an ongoing collaborative effort between the medical and the signal processing communities to promote research on applying standard Automatic Speech Recognition (ASR) techniques for the automatic diagnosis of patients with severe obstructive sleep apnoea (OSA). Early detection of severe apnoea cases is important so that patients can receive early treatment. Effective ASR-based detection could dramatically cut medical testing time. Working with a carefully designed speech database of healthy and apnoea subjects, we describe an acoustic search for distinctive apnoea voice characteristics. We also study abnormal nasalization in OSA patients by modelling vowels in nasal and nonnasal phonetic contexts using Gaussian Mixture Model (GMM) pattern recognition on speech spectra. Finally, we present experimental findings regarding the discriminative power of GMMs applied to severe apnoea detection. We have achieved an 81% correct classification rate, which is very promising and underpins the interest in this line of inquiry.

  14. Physiologically Motivated Feature Extraction for Robust Automatic Speech Recognition

    OpenAIRE

    Ibrahim Missaoui; Zied Lachiri

    2016-01-01

    In this paper, a new method is presented to extract robust speech features in the presence of the external noise. The proposed method based on two-dimensional Gabor filters takes in account the spectro-temporal modulation frequencies and also limits the redundancy on the feature level. The performance of the proposed feature extraction method was evaluated on isolated speech words which are extracted from TIMIT corpus and corrupted by background noise. The evaluation results demonstrate that ...

  15. Correcting Automatic Speech Recognition Errors in Real Time

    OpenAIRE

    Wald, M; Boulain, P; Bell, J.; Doody, K; Gerrard, J

    2007-01-01

    Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes while they are lip-reading or watching a sign-language interpreter. Synchronising the speech with text captions can ensure deaf students are not disadvantaged and assist all learners to search fo...

  16. Automatic speech recognition (zero crossing method). Automatic recognition of isolated vowels

    International Nuclear Information System (INIS)

    This note describes a recognition method of isolated vowels, using a preprocessing of the vocal signal. The processing extracts the extrema of the vocal signal and the interval time separating them (Zero crossing distances of the first derivative of the signal). The recognition of vowels uses normalized histograms of the values of these intervals. The program determines a distance between the histogram of the sound to be recognized and histograms models built during a learning phase. The results processed on real time by a minicomputer, are relatively independent of the speaker, the fundamental frequency being not allowed to vary too much (i.e. speakers of the same sex). (author)

  17. Physiologically Motivated Feature Extraction for Robust Automatic Speech Recognition

    Directory of Open Access Journals (Sweden)

    Ibrahim Missaoui

    2016-04-01

    Full Text Available In this paper, a new method is presented to extract robust speech features in the presence of the external noise. The proposed method based on two-dimensional Gabor filters takes in account the spectro-temporal modulation frequencies and also limits the redundancy on the feature level. The performance of the proposed feature extraction method was evaluated on isolated speech words which are extracted from TIMIT corpus and corrupted by background noise. The evaluation results demonstrate that the proposed feature extraction method outperforms the classic methods such as Perceptual Linear Prediction, Linear Predictive Coding, Linear Prediction Cepstral coefficients and Mel Frequency Cepstral Coefficients.

  18. Noise robust automatic speech recognition with adaptive quantile based noise estimation and speech band emphasizing filter bank

    DEFF Research Database (Denmark)

    Bonde, Casper Stork; Graversen, Carina; Gregersen, Andreas Gregers;

    2005-01-01

    appearance of the speech signal which require noise robust voice activity detection and assumptions of stationary noise. However, both of these requirements are often not met and it is therefore of particular interest to investigate methods like the Quantile Based Noise Estimation (QBNE) mehtod which......An important topic in Automatic Speech Recognition (ASR) is to reduce the effect of noise, in particular when mismatch exists between the training and application conditions. Many noise robutness schemes within the feature processing domain use as a prerequisite a noise estimate prior to the...... estimates the noise during speech and non-speech sections without the use of a voice activity detector. While the standard QBNE-method uses a fixed pre-defined quantile accross all frequency bands, this paper suggests adaptive QBNE (AQBNE) which adapts the quantile individually to each frequency band...

  19. Arabic Language Learning Assisted by Computer, based on Automatic Speech Recognition

    CERN Document Server

    Terbeh, Naim

    2012-01-01

    This work consists of creating a system of the Computer Assisted Language Learning (CALL) based on a system of Automatic Speech Recognition (ASR) for the Arabic language using the tool CMU Sphinx3 [1], based on the approach of HMM. To this work, we have constructed a corpus of six hours of speech recordings with a number of nine speakers. we find in the robustness to noise a grounds for the choice of the HMM approach [2]. the results achieved are encouraging since our corpus is made by only nine speakers, but they are always reasons that open the door for other improvement works.

  20. Suprasegmental Duration Modelling with Elastic Constraints in Automatic Speech Recognition

    OpenAIRE

    Molloy, Laurence; Isard, Stephen

    1998-01-01

    In this paper a method of integrating a model of suprasegmental duration with a HMM-based recogniser at the post-processing level is presented. The N-Best utterance output is rescored using a suitable linear combination of acoustic log-likelihood (provided by a set of tied-state triphone HMMs) and duration log-likelihood (provided by a set of durational models). The durational model used in the post-processing imposes syllable-level elastic constraints on the durational behaviour of speech se...

  1. A HYBRID METHOD FOR AUTOMATIC SPEECH RECOGNITION PERFORMANCE IMPROVEMENT IN REAL WORLD NOISY ENVIRONMENT

    Directory of Open Access Journals (Sweden)

    Urmila Shrawankar

    2013-01-01

    Full Text Available It is a well known fact that, speech recognition systems perform well when the system is used in conditions similar to the one used to train the acoustic models. However, mismatches degrade the performance. In adverse environment, it is very difficult to predict the category of noise in advance in case of real world environmental noise and difficult to achieve environmental robustness. After doing rigorous experimental study it is observed that, a unique method is not available that will clean the noisy speech as well as preserve the quality which have been corrupted by real natural environmental (mixed noise. It is also observed that only back-end techniques are not sufficient to improve the performance of a speech recognition system. It is necessary to implement performance improvement techniques at every step of back-end as well as front-end of the Automatic Speech Recognition (ASR model. Current recognition systems solve this problem using a technique called adaptation. This study presents an experimental study that aims two points, first is to implement the hybrid method that will take care of clarifying the speech signal as much as possible with all combinations of filters and enhancement techniques. The second point is to develop a method for training all categories of noise that can adapt the acoustic models for a new environment that will help to improve the performance of the speech recognizer under real world environmental mismatched conditions. This experiment confirms that hybrid adaptation methods improve the ASR performance on both levels, (Signal-to-Noise Ratio SNR improvement as well as word recognition accuracy in real world noisy environment.

  2. Analysis of Phonetic Transcriptions for Danish Automatic Speech Recognition

    DEFF Research Database (Denmark)

    Kirkedal, Andreas Søeborg

    recognition system depends heavily on the dictionary and the transcriptions therein. This paper presents an analysis of phonetic/phonemic features that are salient for current Danish ASR systems. This preliminary study consists of a series of experiments using an ASR system trained on the DK-PAROLE corpus....... The analysis indicates that transcribing e.g. stress or vowel duration has a negative impact on performance. The best performance is obtained with coarse phonetic annotation and improves performance 1% word error rate and 3.8% sentence error rate....

  3. Contribution to automatic speech recognition. Analysis of the direct acoustical signal. Recognition of isolated words and phoneme identification

    International Nuclear Information System (INIS)

    This report deals with the acoustical-phonetic step of the automatic recognition of the speech. The parameters used are the extrema of the acoustical signal (coded in amplitude and duration). This coding method, the properties of which are described, is simple and well adapted to a digital processing. The quality and the intelligibility of the coded signal after reconstruction are particularly satisfactory. An experiment for the automatic recognition of isolated words has been carried using this coding system. We have designed a filtering algorithm operating on the parameters of the coding. Thus the characteristics of the formants can be derived under certain conditions which are discussed. Using these characteristics the identification of a large part of the phonemes for a given speaker was achieved. Carrying on the studies has required the development of a particular methodology of real time processing which allowed immediate evaluation of the improvement of the programs. Such processing on temporal coding of the acoustical signal is extremely powerful and could represent, used in connection with other methods an efficient tool for the automatic processing of the speech.(author)

  4. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Umit H. Yapanel

    2008-08-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  5. Towards an Intelligent Acoustic Front End for Automatic Speech Recognition: Built-in Speaker Normalization

    Directory of Open Access Journals (Sweden)

    Yapanel UmitH

    2008-01-01

    Full Text Available A proven method for achieving effective automatic speech recognition (ASR due to speaker differences is to perform acoustic feature speaker normalization. More effective speaker normalization methods are needed which require limited computing resources for real-time performance. The most popular speaker normalization technique is vocal-tract length normalization (VTLN, despite the fact that it is computationally expensive. In this study, we propose a novel online VTLN algorithm entitled built-in speaker normalization (BISN, where normalization is performed on-the-fly within a newly proposed PMVDR acoustic front end. The novel algorithm aspect is that in conventional frontend processing with PMVDR and VTLN, two separating warping phases are needed; while in the proposed BISN method only one single speaker dependent warp is used to achieve both the PMVDR perceptual warp and VTLN warp simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces simultaneously. This improved integration unifies the nonlinear warping performed in the front end and reduces computational requirements, thereby offering advantages for real-time ASR systems. Evaluations are performed for (i an in-car extended digit recognition task, where an on-the-fly BISN implementation reduces the relative word error rate (WER by 24%, and (ii for a diverse noisy speech task (SPINE 2, where the relative WER improvement was 9%, both relative to the baseline speaker normalization method.

  6. Automatic Speech Recognition Using Template Model for Man-Machine Interface

    OpenAIRE

    Mishra, Neema; Shrawankar, Urmila; Thakare, V. M

    2013-01-01

    Speech is a natural form of communication for human beings, and computers with the ability to understand speech and speak with a human voice are expected to contribute to the development of more natural man-machine interfaces. Computers with this kind of ability are gradually becoming a reality, through the evolution of speech recognition technologies. Speech is being an important mode of interaction with computers. In this paper Feature extraction is implemented using well-known Mel-Frequenc...

  7. An exploration of the potential of Automatic Speech Recognition to assist and enable receptive communication in higher education

    Directory of Open Access Journals (Sweden)

    Mike Wald

    2006-12-01

    Full Text Available The potential use of Automatic Speech Recognition to assist receptive communication is explored. The opportunities and challenges that this technology presents students and staff to provide captioning of speech online or in classrooms for deaf or hard of hearing students and assist blind, visually impaired or dyslexic learners to read and search learning material more readily by augmenting synthetic speech with natural recorded real speech is also discussed and evaluated. The automatic provision of online lecture notes, synchronised with speech, enables staff and students to focus on learning and teaching issues, while also benefiting learners unable to attend the lecture or who find it difficult or impossible to take notes at the same time as listening, watching and thinking.

  8. Call recognition and individual identification of fish vocalizations based on automatic speech recognition: An example with the Lusitanian toadfish.

    Science.gov (United States)

    Vieira, Manuel; Fonseca, Paulo J; Amorim, M Clara P; Teixeira, Carlos J C

    2015-12-01

    The study of acoustic communication in animals often requires not only the recognition of species specific acoustic signals but also the identification of individual subjects, all in a complex acoustic background. Moreover, when very long recordings are to be analyzed, automatic recognition and identification processes are invaluable tools to extract the relevant biological information. A pattern recognition methodology based on hidden Markov models is presented inspired by successful results obtained in the most widely known and complex acoustical communication signal: human speech. This methodology was applied here for the first time to the detection and recognition of fish acoustic signals, specifically in a stream of round-the-clock recordings of Lusitanian toadfish (Halobatrachus didactylus) in their natural estuarine habitat. The results show that this methodology is able not only to detect the mating sounds (boatwhistles) but also to identify individual male toadfish, reaching an identification rate of ca. 95%. Moreover this method also proved to be a powerful tool to assess signal durations in large data sets. However, the system failed in recognizing other sound types. PMID:26723348

  9. Robust Automatic Speech Recognition Features using Complex Wavelet Packet Transform Coefficients

    Directory of Open Access Journals (Sweden)

    TjongWan Sen

    2009-11-01

    Full Text Available To improve the performance of phoneme based Automatic Speech Recognition (ASR in noisy environment; we developed a new technique that could add robustness to clean phonemes features. These robust features are obtained from Complex Wavelet Packet Transform (CWPT coefficients. Since the CWPT coefficients represent all different frequency bands of the input signal, decomposing the input signal into complete CWPT tree would also cover all frequencies involved in recognition process. For time overlapping signals with different frequency contents, e. g. phoneme signal with noises, its CWPT coefficients are the combination of CWPT coefficients of phoneme signal and CWPT coefficients of noises. The CWPT coefficients of phonemes signal would be changed according to frequency components contained in noises. Since the numbers of phonemes in every language are relatively small (limited and already well known, one could easily derive principal component vectors from clean training dataset using Principal Component Analysis (PCA. These principal component vectors could be used then to add robustness and minimize noises effects in testing phase. Simulation results, using Alpha Numeric 4 (AN4 from Carnegie Mellon University and NOISEX-92 examples from Rice University, showed that this new technique could be used as features extractor that improves the robustness of phoneme based ASR systems in various adverse noisy conditions and still preserves the performance in clean environments.

  10. Pattern recognition in speech and language processing

    CERN Document Server

    Chou, Wu

    2003-01-01

    Minimum Classification Error (MSE) Approach in Pattern Recognition, Wu ChouMinimum Bayes-Risk Methods in Automatic Speech Recognition, Vaibhava Goel and William ByrneA Decision Theoretic Formulation for Adaptive and Robust Automatic Speech Recognition, Qiang HuoSpeech Pattern Recognition Using Neural Networks, Shigeru KatagiriLarge Vocabulary Speech Recognition Based on Statistical Methods, Jean-Luc GauvainToward Spontaneous Speech Recognition and Understanding, Sadaoki FuruiSpeaker Authentication, Qi Li and Biing-Hwang JuangHMMs for Language Processing Problems, Ri

  11. Morpho-syntactic post-processing of N-best lists for improved French automatic speech recognition

    OpenAIRE

    Huet, Stéphane; Gravier, Guillaume; Sébillot, Pascale

    2010-01-01

    Abstract Many automatic speech recognition (ASR) systems rely on the sole pronunciation dictionaries and language models to take into account information about language. Implicitly, morphology and syntax are to a certain extent embedded in the language models but the richness of such linguistic knowledge is not exploited. This paper studies the use of morpho-syntactic (MS) information in a post-processing stage of an ASR system, by reordering N-best lists. Each sentence hypothesis ...

  12. Automatic Speaker Recognition System

    Directory of Open Access Journals (Sweden)

    Parul,R. B. Dubey

    2012-12-01

    Full Text Available Spoken language is used by human to convey many types of information. Primarily, speech convey message via words. Owing to advanced speech technologies, people's interactions with remote machines, such as phone banking, internet browsing, and secured information retrieval by voice, is becoming popular today. Speaker verification and speaker identification are important for authentication and verification in security purpose. Speaker identification methods can be divided into text independent and text-dependent. Speaker recognition is the process of automatically recognizing speaker voice on the basis of individual information included in the input speech waves. It consists of comparing a speech signal from an unknown speaker to a set of stored data of known speakers. This process recognizes who has spoken by matching input signal with pre- stored samples. The work is focussed to improve the performance of the speaker verification under noisy conditions.

  13. Dynamic time warping applied to detection of confusable word pairs in automatic speech recognition

    OpenAIRE

    Anguita Ortega, Jan; Hernando Pericás, Francisco Javier

    2005-01-01

    In this paper we present a rnethod to predict if two words are likely to be confused by an Autornatic SpeechRecognition (ASR) systern. This method is based on the c1assical Dynamic Time Warping (DTW) technique. This technique, which is usually used in ASR to measure the distance between two speech signals, is usedhere to calculate the distance between two words. With this distance the words are c1assified as confusable or not confusable using a threshold. We have te...

  14. Estimation of phoneme-specific HMM topologies for the automatic recognition of dysarthric speech.

    Science.gov (United States)

    Caballero-Morales, Santiago-Omar

    2013-01-01

    Dysarthria is a frequently occurring motor speech disorder which can be caused by neurological trauma, cerebral palsy, or degenerative neurological diseases. Because dysarthria affects phonation, articulation, and prosody, spoken communication of dysarthric speakers gets seriously restricted, affecting their quality of life and confidence. Assistive technology has led to the development of speech applications to improve the spoken communication of dysarthric speakers. In this field, this paper presents an approach to improve the accuracy of HMM-based speech recognition systems. Because phonatory dysfunction is a main characteristic of dysarthric speech, the phonemes of a dysarthric speaker are affected at different levels. Thus, the approach consists in finding the most suitable type of HMM topology (Bakis, Ergodic) for each phoneme in the speaker's phonetic repertoire. The topology is further refined with a suitable number of states and Gaussian mixture components for acoustic modelling. This represents a difference when compared with studies where a single topology is assumed for all phonemes. Finding the suitable parameters (topology and mixtures components) is performed with a Genetic Algorithm (GA). Experiments with a well-known dysarthric speech database showed statistically significant improvements of the proposed approach when compared with the single topology approach, even for speakers with severe dysarthria. PMID:24222784

  15. Context dependent speech recognition

    OpenAIRE

    Andersson, Sebastian

    2006-01-01

    Poor speech recognition is a problem when developing spoken dialogue systems, but several studies has showed that speech recognition can be improved by post-processing of recognition output that use the dialogue context, acoustic properties of a user utterance and other available resources to train a statistical model to use as a filter between the speech recogniser and dialogue manager. In this thesis a corpus of logged interactions between users and a dialogue system was used...

  16. Speech Recognition on Mobile Devices

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    The enthusiasm of deploying automatic speech recognition (ASR) on mobile devices is driven both by remarkable advances in ASR technology and by the demand for efficient user interfaces on such devices as mobile phones and personal digital assistants (PDAs). This chapter presents an overview of ASR...

  17. Novel Techniques for Dialectal Arabic Speech Recognition

    CERN Document Server

    Elmahdy, Mohamed; Minker, Wolfgang

    2012-01-01

    Novel Techniques for Dialectal Arabic Speech describes approaches to improve automatic speech recognition for dialectal Arabic. Since speech resources for dialectal Arabic speech recognition are very sparse, the authors describe how existing Modern Standard Arabic (MSA) speech data can be applied to dialectal Arabic speech recognition, while assuming that MSA is always a second language for all Arabic speakers. In this book, Egyptian Colloquial Arabic (ECA) has been chosen as a typical Arabic dialect. ECA is the first ranked Arabic dialect in terms of number of speakers, and a high quality ECA speech corpus with accurate phonetic transcription has been collected. MSA acoustic models were trained using news broadcast speech. In order to cross-lingually use MSA in dialectal Arabic speech recognition, the authors have normalized the phoneme sets for MSA and ECA. After this normalization, they have applied state-of-the-art acoustic model adaptation techniques like Maximum Likelihood Linear Regression (MLLR) and M...

  18. Speech recognition from spectral dynamics

    Indian Academy of Sciences (India)

    Hynek Hermansky

    2011-10-01

    Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to wellaccepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.

  19. Advances in Speech Recognition

    CERN Document Server

    Neustein, Amy

    2010-01-01

    This volume is comprised of contributions from eminent leaders in the speech industry, and presents a comprehensive and in depth analysis of the progress of speech technology in the topical areas of mobile settings, healthcare and call centers. The material addresses the technical aspects of voice technology within the framework of societal needs, such as the use of speech recognition software to produce up-to-date electronic health records, not withstanding patients making changes to health plans and physicians. Included will be discussion of speech engineering, linguistics, human factors ana

  20. Speech recognition based on pattern recognition techniques

    Science.gov (United States)

    Rabiner, Lawrence R.

    1990-05-01

    Algorithms for speech recognition can be characterized broadly as pattern recognition approaches and acoustic phonetic approaches. To date, the greatest degree of success in speech recognition has been obtained using pattern recognition paradigms. The use of pattern recognition techniques were applied to the problems of isolated word (or discrete utterance) recognition, connected word recognition, and continuous speech recognition. It is shown that understanding (and consequently the resulting recognizer performance) is best to the simplest recognition tasks and is considerably less well developed for large scale recognition systems.

  1. HUMAN SPEECH EMOTION RECOGNITION

    Directory of Open Access Journals (Sweden)

    Maheshwari Selvaraj

    2016-02-01

    Full Text Available Emotions play an extremely important role in human mental life. It is a medium of expression of one’s perspective or one’s mental state to others. Speech Emotion Recognition (SER can be defined as extraction of the emotional state of the speaker from his or her speech signal. There are few universal emotions- including Neutral, Anger, Happiness, Sadness in which any intelligent system with finite computational resources can be trained to identify or synthesize as required. In this work spectral and prosodic features are used for speech emotion recognition because both of these features contain the emotional information. Mel-frequency cepstral coefficients (MFCC is one of the spectral features. Fundamental frequency, loudness, pitch and speech intensity and glottal parameters are the prosodic features which are used to model different emotions. The potential features are extracted from each utterance for the computational mapping between emotions and speech patterns. Pitch can be detected from the selected features, using which gender can be classified. Support Vector Machine (SVM, is used to classify the gender in this work. Radial Basis Function and Back Propagation Network is used to recognize the emotions based on the selected features, and proved that radial basis function produce more accurate results for emotion recognition than the back propagation network.

  2. Robust speech recognition using articulatory information

    OpenAIRE

    Kirchhoff, Katrin

    1999-01-01

    Current automatic speech recognition systems make use of a single source of information about their input, viz. a preprocessed form of the acoustic speech signal, which encodes the time-frequency distribution of signal energy. The goal of this thesis is to investigate the benefits of integrating articulatory information into state-of-the art speech recognizers, either as a genuine alternative to standard acoustic representations, or as an additional source of information. Articulatory informa...

  3. Creating Accessible Educational Multimedia through Editing Automatic Speech Recognition Captioning in Real Time

    OpenAIRE

    Wald, M

    2006-01-01

    Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes while they are lip-reading or watching a sign-language interpreter. Notetakers can only summarise what is being said while qualified sign language interpreters with a good understanding of the re...

  4. Captioning for Deaf and Hard of Hearing People by Editing Automatic Speech Recognition in Real Time

    OpenAIRE

    Wald, M

    2006-01-01

    Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes when lip-reading or watching a sign-language interpreter. Notetakers summarise what is being said while qualified sign language interpreters with a good understanding of the relevant higher education subject content are in very scarce supply. Real time captioning/transcription is not normally available in UK higher education because of the shortage of real time stenographers. Lectures...

  5. The Phase Spectra Based Feature for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Abbasian ALI

    2009-07-01

    Full Text Available Speech recognition in adverse environment is one of the major issue in automatic speech recognition nowadays. While most current speech recognition system show to be highly efficient for ideal environment but their performance go down extremely when they are applied in real environment because of noise effected speech. In this paper a new feature representation based on phase spectra and Perceptual Linear Prediction (PLP has been suggested which can be used for robust speech recognition. It is shown that this new features can improve the performance of speech recognition not only in clean condition but also in various levels of noise condition when it is compared to PLP features.

  6. Optimizing Automatic Speech Recognition for Low-Proficient Non-Native Speakers

    Directory of Open Access Journals (Sweden)

    Catia Cucchiarini

    2010-01-01

    Full Text Available Computer-Assisted Language Learning (CALL applications for improving the oral skills of low-proficient learners have to cope with non-native speech that is particularly challenging. Since unconstrained non-native ASR is still problematic, a possible solution is to elicit constrained responses from the learners. In this paper, we describe experiments aimed at selecting utterances from lists of responses. The first experiment on utterance selection indicates that the decoding process can be improved by optimizing the language model and the acoustic models, thus reducing the utterance error rate from 29–26% to 10–8%. Since giving feedback on incorrectly recognized utterances is confusing, we verify the correctness of the utterance before providing feedback. The results of the second experiment on utterance verification indicate that combining duration-related features with a likelihood ratio (LR yield an equal error rate (EER of 10.3%, which is significantly better than the EER for the other measures in isolation.

  7. Speech recognition in university classrooms

    OpenAIRE

    Wald, Mike; Bain, Keith; Basson, Sara H

    2002-01-01

    The LIBERATED LEARNING PROJECT (LLP) is an applied research project studying two core questions: 1) Can speech recognition (SR) technology successfully digitize lectures to display spoken words as text in university classrooms? 2) Can speech recognition technology be used successfully as an alternative to traditional classroom notetaking for persons with disabilities? This paper addresses these intriguing questions and explores the underlying complex relationship between speech recognition te...

  8. Machines a Comprendre la Parole: Methodologie et Bilan de Recherche (Automatic Speech Recognition: Methodology and the State of the Research)

    Science.gov (United States)

    Haton, Jean-Pierre

    1974-01-01

    Still no decisive result has been achieved in the automatic machine recognition of sentences of a natural language. Current research concentrates on developing algorithms for syntactic and semantic analysis. It is obvious that clues from all levels of perception have to be taken into account if a long term solution is ever to be found. (Author/MSE)

  9. Phonetic Alphabet for Speech Recognition of Czech

    OpenAIRE

    J. Uhlir; Psutka, J.; J. Nouza

    1997-01-01

    In the paper we introduce and discuss an alphabet that has been proposed for phonemicly oriented automatic speech recognition. The alphabet, denoted as a PAC (Phonetic Alphabet for Czech) consists of 48 basic symbols that allow for distinguishing all major events occurring in spoken Czech language. The symbols can be used both for phonetic transcription of Czech texts as well as for labeling recorded speech signals. From practical reasons, the alphabet occurs in two versions; one utilizes Cze...

  10. Robust coarticulatory modeling for continuous speech recognition

    Science.gov (United States)

    Schwartz, R.; Chow, Y. L.; Dunham, M. O.; Kimball, O.; Krasner, M.; Kubala, F.; Makhoul, J.; Price, P.; Roucos, S.

    1986-10-01

    The purpose of this project is to perform research into algorithms for the automatic recognition of individual sounds or phonemes in continuous speech. The algorithms developed should be appropriate for understanding large-vocabulary continuous speech input and are to be made available to the Strategic Computing Program for incorporation in a complete word recognition system. This report describes process to date in developing phonetic models that are appropriate for continuous speech recognition. In continuous speech, the acoustic realization of each phoneme depends heavily on the preceding and following phonemes: a process known as coarticulation. Thus, while there are relatively few phonemes in English (on the order of fifty or so), the number of possible different accoustic realizations is in the thousands. Therefore, to develop high-accuracy recognition algorithms, one may need to develop literally thousands of relatively distance phonetic models to represent the various phonetic context adequately. Developing a large number of models usually necessitates having a large amount of speech to provide reliable estimates of the model parameters. The major contributions of this work are the development of: (1) A simple but powerful formalism for modeling phonemes in context; (2) Robust training methods for the reliable estimation of model parameters by utilizing the available speech training data in a maximally effective way; and (3) Efficient search strategies for phonetic recognition while maintaining high recognition accuracy.

  11. The Use of Speech Recognition Technology in Automotive Applications

    OpenAIRE

    Gellatly, Andrew William

    1997-01-01

    The research objectives were (1) to perform a detailed review of the literature on speech recognition technology and the attentional demands of driving; (2) to develop decision tools that assist designers of in-vehicle systems; (3) to experimentally examine automatic speech recognition (ASR) design parameters, input modalities, and driver ages; and (4) to provide human factors recommendations for the use of speech recognition technology in automotive applicatio...

  12. Speech Recognition Technology: Applications & Future

    OpenAIRE

    Pankaj Pathak

    2010-01-01

    Voice or speech recognition is "the technology by which sounds, words or phrases spoken by humans are converted into electrical signals, and these signals are transformed into coding patterns to which meaning has been assigned", .It is the technology needs a combination of improved artificial intelligence technology and a more sophisticated speech-recognition engine . Initially a primitive device is developed which could recognize speech, by AT & T Bell Laboratories in the 1940s. According to...

  13. Multi-thread Parallel Speech Recognition for Mobile Applications

    Directory of Open Access Journals (Sweden)

    LOJKA Martin

    2014-05-01

    Full Text Available In this paper, the server based solution of the multi-thread large vocabulary automatic speech recognition engine is described along with the Android OS and HTML5 practical application examples. The basic idea was to bring speech recognition available for full variety of applications for computers and especially for mobile devices. The speech recognition engine should be independent of commercial products and services (where the dictionary could not be modified. Using of third-party services could be also a security and privacy problem in specific applications, when the unsecured audio data could not be sent to uncontrolled environments (voice data transferred to servers around the globe. Using our experience with speech recognition applications, we have been able to construct a multi-thread speech recognition serverbased solution designed for simple applications interface (API to speech recognition engine modified to specific needs of particular application.

  14. Emotion Recognition from Persian Speech with Neural Network

    Directory of Open Access Journals (Sweden)

    Mina Hamidi

    2012-09-01

    Full Text Available In this paper, we report an effort towards automatic recognition of emotional states from continuous Persian speech. Due to the unavailability of appropriate database in the Persian language for emotion recognition, at first, we built a database of emotional speech in Persian. This database consists of 2400 wave clips modulated with anger, disgust, fear, sadness, happiness and normal emotions. Then we extract prosodic features, including features related to the pitch, intensity and global characteristics of the speech signal. Finally, we applied neural networks for automatic recognition of emotion. The resulting average accuracy was about 78%.

  15. Assessing the Performance of Automatic Speech Recognition Systems When Used by Native and Non-Native Speakers of Three Major Languages in Dictation Workflows

    DEFF Research Database (Denmark)

    Zapata, Julián; Kirkedal, Andreas Søeborg

    In this paper, we report on a two-part experiment aiming to assess and compare the performance of two types of automatic speech recognition (ASR) systems on two different computational platforms when used to augment dictation workflows. The experiment was performed with a sample of speakers of...... three major languages and with different linguistic profiles: non-native English speakers; non-native French speakers; and native Spanish speakers. The main objective of this experiment is to examine ASR performance in translation dictation (TD) and medical dictation (MD) workflows without manual...... transcription vs. with transcription. We discuss the advantages and drawbacks of a particular ASR approach in different computational platforms when used by various speakers of a given language, who may have different accents and levels of proficiency in that language, and who may have different levels of...

  16. Phonetic Alphabet for Speech Recognition of Czech

    Directory of Open Access Journals (Sweden)

    J. Uhlir

    1997-12-01

    Full Text Available In the paper we introduce and discuss an alphabet that has been proposed for phonemicly oriented automatic speech recognition. The alphabet, denoted as a PAC (Phonetic Alphabet for Czech consists of 48 basic symbols that allow for distinguishing all major events occurring in spoken Czech language. The symbols can be used both for phonetic transcription of Czech texts as well as for labeling recorded speech signals. From practical reasons, the alphabet occurs in two versions; one utilizes Czech native characters and the other employs symbols similar to those used for English in the DARPA and NIST alphabets.

  17. Post-editing through Speech Recognition

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    In the past couple of years automatic speech recognition (ASR) software has quietly created a niche for itself in many situations of our lives. Nowadays it can be found at the other end of customer-support hotlines, it is built into operating systems and it is offered as an alternative text...... the most popular computer-aided translation workbenches in the market (i.e. MemoQ) together with one of the most well-known ASR packages (i.e. Dragon Naturally Speaking from Nuance). Two data correction modes will be considered: a) keyboard vs. b) keyboard and speech combined. These two different ways...

  18. Speech Recognition for Dental Electronic Health Record

    Czech Academy of Sciences Publication Activity Database

    Nagy, Miroslav; Hanzlíček, Petr; Zvárová, Jana; Dostálová, T.; Seydlová, M.; Hippmann, R.; Smidl, L.; Trmal, J.; Psutka, J.

    Brno: VUTIUM Press, 2008 - (Jan, J.; Kozumplík, J.; Provazník, I.). s. 47-47 ISBN 978-80-214-3612-1. [Biosignal 2008. International EURASIP Conference /19./. 29.06.2008-01.07.2008, Brno] Institutional research plan: CEZ:AV0Z10300504 Keywords : automatic speech recognition * electronic health record * dental medicine Subject RIV: IN - Informatics, Computer Science

  19. Emotion Recognition using Speech Features

    CERN Document Server

    Rao, K Sreenivasa

    2013-01-01

    “Emotion Recognition Using Speech Features” covers emotion-specific features present in speech and discussion of suitable models for capturing emotion-specific information for distinguishing different emotions.  The content of this book is important for designing and developing  natural and sophisticated speech systems. Drs. Rao and Koolagudi lead a discussion of how emotion-specific information is embedded in speech and how to acquire emotion-specific knowledge using appropriate statistical models. Additionally, the authors provide information about using evidence derived from various features and models. The acquired emotion-specific knowledge is useful for synthesizing emotions. Discussion includes global and local prosodic features at syllable, word and phrase levels, helpful for capturing emotion-discriminative information; use of complementary evidences obtained from excitation sources, vocal tract systems and prosodic features in order to enhance the emotion recognition performance;  and pro...

  20. Time-expanded speech and speech recognition in older adults.

    Science.gov (United States)

    Vaughan, Nancy E; Furukawa, Izumi; Balasingam, Nirmala; Mortz, Margaret; Fausti, Stephen A

    2002-01-01

    Speech understanding deficits are common in older adults. In addition to hearing sensitivity, changes in certain cognitive functions may affect speech recognition. One such change that may impact the ability to follow a rapidly changing speech signal is processing speed. When speakers slow the rate of their speech naturally in order to speak clearly, speech recognition is improved. The acoustic characteristics of naturally slowed speech are of interest in developing time-expansion algorithms to improve speech recognition for older listeners. In this study, we tested younger normally hearing, older normally hearing, and older hearing-impaired listeners on time-expanded speech using increased duration and increased intensity of unvoiced consonants. Although all groups performed best on unprocessed speech, performance with processed speech was better with the consonant gain feature without time expansion in the noise condition and better at the slowest time-expanded rate in the quiet condition. The effects of signal processing on speech recognition are discussed. PMID:17642020

  1. Lattice Parsing for Speech Recognition

    OpenAIRE

    Chappelier, Jean-Cédric; Rajman, Martin; Aragües, Ramon; Rozenknop, Antoine

    1999-01-01

    A lot of work remains to be done in the domain of a better integration of speech recognition and language processing systems. This paper gives an overview of several strategies for integrating linguistic models into speech understanding systems and investigates several ways of producing sets of hypotheses that include more "semantic" variability than usual language models. The main goal is to present and demonstrate by actual experiments that sequential couplingmay be efficiently achieved byw...

  2. Discriminative learning for speech recognition

    CERN Document Server

    He, Xiadong

    2008-01-01

    In this book, we introduce the background and mainstream methods of probabilistic modeling and discriminative parameter optimization for speech recognition. The specific models treated in depth include the widely used exponential-family distributions and the hidden Markov model. A detailed study is presented on unifying the common objective functions for discriminative learning in speech recognition, namely maximum mutual information (MMI), minimum classification error, and minimum phone/word error. The unification is presented, with rigorous mathematical analysis, in a common rational-functio

  3. Automatic Speech Segmentation Based on HMM

    OpenAIRE

    M. Kroul

    2007-01-01

    This contribution deals with the problem of automatic phoneme segmentation using HMMs. Automatization of speech segmentation task is important for applications, where large amount of data is needed to process, so manual segmentation is out of the question. In this paper we focus on automatic segmentation of recordings, which will be used for triphone synthesis unit database creation. For speech synthesis, the speech unit quality is a crucial aspect, so the maximal accuracy in segmentation is ...

  4. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Hiroshi Saruwatari

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a 93.9% word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  5. Unvoiced Speech Recognition Using Tissue-Conductive Acoustic Sensor

    Directory of Open Access Journals (Sweden)

    Heracleous Panikos

    2007-01-01

    Full Text Available We present the use of stethoscope and silicon NAM (nonaudible murmur microphones in automatic speech recognition. NAM microphones are special acoustic sensors, which are attached behind the talker's ear and can capture not only normal (audible speech, but also very quietly uttered speech (nonaudible murmur. As a result, NAM microphones can be applied in automatic speech recognition systems when privacy is desired in human-machine communication. Moreover, NAM microphones show robustness against noise and they might be used in special systems (speech recognition, speech transform, etc. for sound-impaired people. Using adaptation techniques and a small amount of training data, we achieved for a 20 k dictation task a word accuracy for nonaudible murmur recognition in a clean environment. In this paper, we also investigate nonaudible murmur recognition in noisy environments and the effect of the Lombard reflex on nonaudible murmur recognition. We also propose three methods to integrate audible speech and nonaudible murmur recognition using a stethoscope NAM microphone with very promising results.

  6. Speech recognition: Acoustic, phonetic and lexical knowledge

    Science.gov (United States)

    Zue, V. W.

    1985-08-01

    During this reporting period we continued to make progress on the acquisition of acoustic-phonetic and lexical knowledge. We completed development of a continuous digit recognition system. The system was constructed to investigate the use of acoustic-phonetic knowledge in a speech recognition system. The significant achievements of this study include the development of a soft-failure procedure for lexical access and the discovery of a set of acoustic-phonetic features for verification. We completed a study of the constraints that lexical stress imposes on word recognition. We found that lexical stress information alone can, on the average, reduce the number of word candidates from a large dictionary by more than 80 percent. In conjunction with this study, we successfully developed a system that automatically determines the stress pattern of a word from the acoustic signal. We performed an acoustic study on the characteristics of nasal consonants and nasalized vowels. We have also developed recognition algorithms for nasal murmurs and nasalized vowels in continuous speech. We finished the preliminary development of a system that aligns a speech waveform with the corresponding phonetic transcription.

  7. The benefit obtained from visually displayed text from an automatic speech recognizer during listening to speech presented in noise

    NARCIS (Netherlands)

    Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.

    2008-01-01

    OBJECTIVES: The aim of this study was to evaluate the benefit that listeners obtain from visually presented output from an automatic speech recognition (ASR) system during listening to speech in noise. DESIGN: Auditory-alone and audiovisual speech reception thresholds (SRTs) were measured. The SRT i

  8. On speech recognition during anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2007-01-01

    This PhD thesis in human-computer interfaces (informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia ...... accuracy. Finally, the last part of the thesis looks at the acceptance and success of a speech recognition system introduced in a Danish hospital to produce patient records.......This PhD thesis in human-computer interfaces (informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia...... inaccuracies in the anaesthesia record. Supplementing the electronic anaesthesia record interface with speech input facilities is proposed as one possible solution to a part of the problem. The testing of the various hypotheses has involved the development of a prototype of an electronic anaesthesia record...

  9. Automatic Number Plate Recognition System

    OpenAIRE

    Rajshree Dhruw; Dharmendra Roy

    2014-01-01

    Automatic Number Plate Recognition (ANPR) is a mass surveillance system that captures the image of vehicles and recognizes their license number. The objective is to design an efficient automatic authorized vehicle identification system by using the Indian vehicle number plate. In this paper we discus different methodology for number plate localization, character segmentation & recognition of the number plate. The system is mainly applicable for non standard Indian number plates by recognizing...

  10. Recent Advances in Robust Speech Recognition Technology

    CERN Document Server

    Ramírez, Javier

    2011-01-01

    This E-book is a collection of articles that describe advances in speech recognition technology. Robustness in speech recognition refers to the need to maintain high speech recognition accuracy even when the quality of the input speech is degraded, or when the acoustical, articulate, or phonetic characteristics of speech in the training and testing environments differ. Obstacles to robust recognition include acoustical degradations produced by additive noise, the effects of linear filtering, nonlinearities in transduction or transmission, as well as impulsive interfering sources, and diminishe

  11. A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM

    Directory of Open Access Journals (Sweden)

    Chenchen Huang

    2014-01-01

    Full Text Available Feature extraction is a very important part in speech emotion recognition, and in allusion to feature extraction in speech emotion recognition problems, this paper proposed a new method of feature extraction, using DBNs in DNN to extract emotional features in speech signal automatically. By training a 5 layers depth DBNs, to extract speech emotion feature and incorporate multiple consecutive frames to form a high dimensional feature. The features after training in DBNs were the input of nonlinear SVM classifier, and finally speech emotion recognition multiple classifier system was achieved. The speech emotion recognition rate of the system reached 86.5%, which was 7% higher than the original method.

  12. Prediction Method of Speech Recognition Performance Based on HMM-based Speech Synthesis Technique

    Science.gov (United States)

    Terashima, Ryuta; Yoshimura, Takayoshi; Wakita, Toshihiro; Tokuda, Keiichi; Kitamura, Tadashi

    We describe an efficient method that uses a HMM-based speech synthesis technique as a test pattern generator for evaluating the word recognition rate. The recognition rates of each word and speaker can be evaluated by the synthesized speech by using this method. The parameter generation technique can be formulated as an algorithm that can determine the speech parameter vector sequence O by maximizing P(O¦Q,λ) given the model parameter λ and the state sequence Q, under a dynamic acoustic feature constraint. We conducted recognition experiments to illustrate the validity of the method. Approximately 100 speakers were used to train the speaker dependent models for the speech synthesis used in these experiments, and the synthetic speech was generated as the test patterns for the target speech recognizer. As a result, the recognition rate of the HMM-based synthesized speech shows a good correlation with the recognition rate of the actual speech. Furthermore, we find that our method can predict the speaker recognition rate with approximately 2% error on average. Therefore the evaluation of the speaker recognition rate will be performed automatically by using the proposed method.

  13. On speech recognition during anaesthesia

    OpenAIRE

    Alapetite, Alexandre

    2007-01-01

    This PhD thesis in human-computer interfaces (HCI, informatics) studies the case of the anaesthesia record used during medical operations and the possibility to supplement it with speech recognition facilities. Problems and limitations have been identified with the traditional paper-based anaesthesia record, but also with newer electronic versions, in particular ergonomic issues and the fact that anaesthesiologists tend to postpone the registration of the medications and other events during b...

  14. Emotion Recognition from Persian Speech with Neural Network

    Directory of Open Access Journals (Sweden)

    Mina Hamidi

    2012-10-01

    Full Text Available In this paper, we report an effort towards automatic recognition of emotional states from continuousPersian speech. Due to the unavailability of appropriate database in the Persian language for emotionrecognition, at first, we built a database of emotional speech in Persian. This database consists of 2400wave clips modulated with anger, disgust, fear, sadness, happiness and normal emotions. Then we extractprosodic features, including features related to the pitch, intensity and global characteristics of the speechsignal. Finally, we applied neural networks for automatic recognition of emotion. The resulting averageaccuracy was about 78%.

  15. Automatic Speech Receognition for Human-Machine Interaction

    OpenAIRE

    Biundo, Giuseppina; Grassi Pauletti, Sara; Ansorge, Michael; Farine, Pierre-André

    2005-01-01

    Since the sixties, movies such as “2001: A Space Odyssey” have familiarized us with the idea of com-puters that can speak and hear just as a human being does. Automatic speech recogni-tion (ASR) is the technol-ogy that allows machines to interpret human speech (i.e. to answer the ques-tion: What is being said?). The machine ”speaks back“, either by playing pre-recorded messages or by using text-to-speech (TTS) technology.

  16. Neural Network Based Hausa Language Speech Recognition

    Directory of Open Access Journals (Sweden)

    Matthew K Luka

    2012-05-01

    Full Text Available Speech recognition is a key element of diverse applications in communication systems, medical transcription systems, security systems etc. However, there has been very little research in the domain of speech processing for African languages, thus, the need to extend the frontier of research in order to port in, the diverse applications based on speech recognition. Hausa language is an important indigenous lingua franca in west and central Africa, spoken as a first or second language by about fifty million people. Speech recognition of Hausa Language is presented in this paper. A pattern recognition neural network was used for developing the system.

  17. Comparative wavelet, PLP, and LPC speech recognition techniques on the Hindi speech digits database

    Science.gov (United States)

    Mishra, A. N.; Shrotriya, M. C.; Sharan, S. N.

    2010-02-01

    In view of the growing use of automatic speech recognition in the modern society, we study various alternative representations of the speech signal that have the potential to contribute to the improvement of the recognition performance. In this paper wavelet based features using different wavelets are used for Hindi digits recognition. The recognition performance of these features has been compared with Linear Prediction Coefficients (LPC) and Perceptual Linear Prediction (PLP) features. All features have been tested using Hidden Markov Model (HMM) based classifier for speaker independent Hindi digits recognition. The recognition performance of PLP features is11.3% better than LPC features. The recognition performance with db10 features has shown a further improvement of 12.55% over PLP features. The recognition performance with db10 is best among all wavelet based features.

  18. Automatic Licenses Plate Recognition

    OpenAIRE

    Ronak P Patel; Narendra M Patel; Keyur Brahmbhatt

    2013-01-01

    This paper describes the Smart Vehicle Screening System, which can be installed into a tollboothfor automated recognition of vehicle license plate information using a photograph of a vehicle. An automatedsystem could then be implemented to control the payment of fees, parking areas, highways, bridges ortunnels, etc. This paper contains new algorithm for recognition number plate using Morphological operation,Thresholding operation, Edge detection, Bounding box analysis for number plate extract...

  19. GesRec3D: a real-time coded gesture-to-speech system with automatic segmentation and recognition thresholding using dissimilarity measures

    OpenAIRE

    Craven, Michael P; Curtis, K. Mervyn

    2004-01-01

    A complete microcomputer system is described, GesRec3D, which facilitates the data acquisition, segmentation, learning, and recognition of 3-Dimensional arm gestures, with application as a Augmentative and Alternative Communication (AAC) aid for people with motor and speech disability. The gesture data is acquired from a Polhemus electro-magnetic tracker system, with sensors attached to the finger, wrist and elbow of one arm. Coded gestures are linked to user-defined text, to be spoken by a t...

  20. Automatic pattern recognition

    OpenAIRE

    Petheram, R.J.

    1989-01-01

    In this thesis the author presents a new method for the location, extraction and normalisation of discrete objects found in digital images. The extraction is by means of sub-pixcel contour following around the object. The normalisation obtains and removes the information concerning size, orientation and location of the object within an image. Analyses of the results are carried out to determine the confidence in recognition of patterns, and methods of cross correlation of object descriptions ...

  1. Automatic speech signal segmentation based on the innovation adaptive filter

    Directory of Open Access Journals (Sweden)

    Makowski Ryszard

    2014-06-01

    Full Text Available Speech segmentation is an essential stage in designing automatic speech recognition systems and one can find several algorithms proposed in the literature. It is a difficult problem, as speech is immensely variable. The aim of the authors’ studies was to design an algorithm that could be employed at the stage of automatic speech recognition. This would make it possible to avoid some problems related to speech signal parametrization. Posing the problem in such a way requires the algorithm to be capable of working in real time. The only such algorithm was proposed by Tyagi et al., (2006, and it is a modified version of Brandt’s algorithm. The article presents a new algorithm for unsupervised automatic speech signal segmentation. It performs segmentation without access to information about the phonetic content of the utterances, relying exclusively on second-order statistics of a speech signal. The starting point for the proposed method is time-varying Schur coefficients of an innovation adaptive filter. The Schur algorithm is known to be fast, precise, stable and capable of rapidly tracking changes in second order signal statistics. A transfer from one phoneme to another in the speech signal always indicates a change in signal statistics caused by vocal track changes. In order to allow for the properties of human hearing, detection of inter-phoneme boundaries is performed based on statistics defined on the mel spectrum determined from the reflection coefficients. The paper presents the structure of the algorithm, defines its properties, lists parameter values, describes detection efficiency results, and compares them with those for another algorithm. The obtained segmentation results, are satisfactory.

  2. Speech recognition systems on the Cell Broadband Engine

    Energy Technology Data Exchange (ETDEWEB)

    Liu, Y; Jones, H; Vaidya, S; Perrone, M; Tydlitat, B; Nanda, A

    2007-04-20

    In this paper we describe our design, implementation, and first results of a prototype connected-phoneme-based speech recognition system on the Cell Broadband Engine{trademark} (Cell/B.E.). Automatic speech recognition decodes speech samples into plain text (other representations are possible) and must process samples at real-time rates. Fortunately, the computational tasks involved in this pipeline are highly data-parallel and can receive significant hardware acceleration from vector-streaming architectures such as the Cell/B.E. Identifying and exploiting these parallelism opportunities is challenging, but also critical to improving system performance. We observed, from our initial performance timings, that a single Cell/B.E. processor can recognize speech from thousands of simultaneous voice channels in real time--a channel density that is orders-of-magnitude greater than the capacity of existing software speech recognizers based on CPUs (central processing units). This result emphasizes the potential for Cell/B.E.-based speech recognition and will likely lead to the future development of production speech systems using Cell/B.E. clusters.

  3. Physics of Automatic Target Recognition

    CERN Document Server

    Sadjadi, Firooz

    2007-01-01

    Physics of Automatic Target Recognition addresses the fundamental physical bases of sensing, and information extraction in the state-of-the art automatic target recognition field. It explores both passive and active multispectral sensing, polarimetric diversity, complex signature exploitation, sensor and processing adaptation, transformation of electromagnetic and acoustic waves in their interactions with targets, background clutter, transmission media, and sensing elements. The general inverse scattering, and advanced signal processing techniques and scientific evaluation methodologies being used in this multi disciplinary field will be part of this exposition. The issues of modeling of target signatures in various spectral modalities, LADAR, IR, SAR, high resolution radar, acoustic, seismic, visible, hyperspectral, in diverse geometric aspects will be addressed. The methods for signal processing and classification will cover concepts such as sensor adaptive and artificial neural networks, time reversal filt...

  4. Pattern Recognition Methods and Features Selection for Speech Emotion Recognition System

    Directory of Open Access Journals (Sweden)

    Pavol Partila

    2015-01-01

    Full Text Available The impact of the classification method and features selection for the speech emotion recognition accuracy is discussed in this paper. Selecting the correct parameters in combination with the classifier is an important part of reducing the complexity of system computing. This step is necessary especially for systems that will be deployed in real-time applications. The reason for the development and improvement of speech emotion recognition systems is wide usability in nowadays automatic voice controlled systems. Berlin database of emotional recordings was used in this experiment. Classification accuracy of artificial neural networks, k-nearest neighbours, and Gaussian mixture model is measured considering the selection of prosodic, spectral, and voice quality features. The purpose was to find an optimal combination of methods and group of features for stress detection in human speech. The research contribution lies in the design of the speech emotion recognition system due to its accuracy and efficiency.

  5. Personality in speech assessment and automatic classification

    CERN Document Server

    Polzehl, Tim

    2015-01-01

    This work combines interdisciplinary knowledge and experience from research fields of psychology, linguistics, audio-processing, machine learning, and computer science. The work systematically explores a novel research topic devoted to automated modeling of personality expression from speech. For this aim, it introduces a novel personality assessment questionnaire and presents the results of extensive labeling sessions to annotate the speech data with personality assessments. It provides estimates of the Big 5 personality traits, i.e. openness, conscientiousness, extroversion, agreeableness, and neuroticism. Based on a database built on the questionnaire, the book presents models to tell apart different personality types or classes from speech automatically.

  6. Speech Recognition in Natural Background Noise

    OpenAIRE

    Julien Meyer; Laure Dentel; Fanny Meunier

    2013-01-01

    In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listene...

  7. Man machine interface based on speech recognition

    International Nuclear Information System (INIS)

    This work reports the development of a Man Machine Interface based on speech recognition. The system must recognize spoken commands, and execute the desired tasks, without manual interventions of operators. The range of applications goes from the execution of commands in an industrial plant's control room, to navigation and interaction in virtual environments. Results are reported for isolated word recognition, the isolated words corresponding to the spoken commands. For the pre-processing stage, relevant parameters are extracted from the speech signals, using the cepstral analysis technique, that are used for isolated word recognition, and corresponds to the inputs of an artificial neural network, that performs recognition tasks. (author)

  8. Likelihood-Maximizing-Based Multiband Spectral Subtraction for Robust Speech Recognition

    Directory of Open Access Journals (Sweden)

    Bagher BabaAli

    2009-01-01

    Full Text Available Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve significant improvement of recognition rates across a wide range of signal to noise ratios.

  9. PCA-Based Speech Enhancement for Distorted Speech Recognition

    Directory of Open Access Journals (Sweden)

    Tetsuya Takiguchi

    2007-09-01

    Full Text Available We investigated a robust speech feature extraction method using kernel PCA (Principal Component Analysis for distorted speech recognition. Kernel PCA has been suggested for various image processing tasks requiring an image model, such as denoising, where a noise-free image is constructed from a noisy input image. Much research for robust speech feature extraction has been done, but it remains difficult to completely remove additive or convolution noise (distortion. The most commonly used noise-removal techniques are based on the spectraldomain operation, and then for speech recognition, the MFCC (Mel Frequency Cepstral Coefficient is computed, where DCT (Discrete Cosine Transform is applied to the mel-scale filter bank output. This paper describes a new PCA-based speech enhancement algorithm using kernel PCA instead of DCT, where the main speech element is projected onto low-order features, while the noise or distortion element is projected onto high-order features. Its effectiveness is confirmed by word recognition experiments on distorted speech.

  10. Low SNR Speech Recognition using SMKL

    Directory of Open Access Journals (Sweden)

    Qin Yuan

    2014-05-01

    Full Text Available While traditional speech recognition methods have achieved great success in a number of real word applications, their further applications to some difficult situations, such as Signal-to-Noise Ratio (SNR signal and local languages, are still limited by their shortcomings in adaption ability. In particular, their robustness to pronunciation level noise is not satisfied enough. To overcome these limitations, in this paper, we propose a novel speech recognition approach for low signal-to-noise ratio signal. The general steps for our speech recognition approach are composed of signal preprocessing, feature extraction and recognition with simple multiple kernel learning (SMKL method. Then the application of SMKL in speech recognition with low SNR is presented. We evaluate the proposed approach over a standard data set. The experimental results show that the performance of SMKL method for low SNR speech recognition is significantly higher than that of the method based on other popular approaches. Further, SMKL based method can be straightforwardly applied to recognition problem of large scale dataset, high dimension data, and a large amount of isomerism information.

  11. Deep Multimodal Learning for Audio-Visual Speech Recognition

    OpenAIRE

    Mroueh, Youssef; Marcheret, Etienne; Goel, Vaibhava

    2015-01-01

    In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of $41\\%$ under clean condition on the IBM large vocabulary audio-visual studio datase...

  12. Speech Clarity Index (Ψ): A Distance-Based Speech Quality Indicator and Recognition Rate Prediction for Dysarthric Speakers with Cerebral Palsy

    Science.gov (United States)

    Kayasith, Prakasith; Theeramunkong, Thanaruk

    It is a tedious and subjective task to measure severity of a dysarthria by manually evaluating his/her speech using available standard assessment methods based on human perception. This paper presents an automated approach to assess speech quality of a dysarthric speaker with cerebral palsy. With the consideration of two complementary factors, speech consistency and speech distinction, a speech quality indicator called speech clarity index (Ψ) is proposed as a measure of the speaker's ability to produce consistent speech signal for a certain word and distinguished speech signal for different words. As an application, it can be used to assess speech quality and forecast speech recognition rate of speech made by an individual dysarthric speaker before actual exhaustive implementation of an automatic speech recognition system for the speaker. The effectiveness of Ψ as a speech recognition rate predictor is evaluated by rank-order inconsistency, correlation coefficient, and root-mean-square of difference. The evaluations had been done by comparing its predicted recognition rates with ones predicted by the standard methods called the articulatory and intelligibility tests based on the two recognition systems (HMM and ANN). The results show that Ψ is a promising indicator for predicting recognition rate of dysarthric speech. All experiments had been done on speech corpus composed of speech data from eight normal speakers and eight dysarthric speakers.

  13. Hidden neural networks: application to speech recognition

    DEFF Research Database (Denmark)

    Riis, Søren Kamaric

    1998-01-01

    We evaluate the hidden neural network HMM/NN hybrid on two speech recognition benchmark tasks; (1) task independent isolated word recognition on the Phonebook database, and (2) recognition of broad phoneme classes in continuous speech from the TIMIT database. It is shown how hidden neural networks...... (HNNs) with much fewer parameters than conventional HMMs and other hybrids can obtain comparable performance, and for the broad class task it is illustrated how the HNN can be applied as a purely transition based system, where acoustic context dependent transition probabilities are estimated by neural...... networks...

  14. Dynamic Programming Algorithms in Speech Recognition

    Directory of Open Access Journals (Sweden)

    Titus Felix FURTUNA

    2008-01-01

    Full Text Available In a system of speech recognition containing words, the recognition requires the comparison between the entry signal of the word and the various words of the dictionary. The problem can be solved efficiently by a dynamic comparison algorithm whose goal is to put in optimal correspondence the temporal scales of the two words. An algorithm of this type is Dynamic Time Warping. This paper presents two alternatives for implementation of the algorithm designed for recognition of the isolated words.

  15. Objective Gender and Age Recognition from Speech Sentences

    Directory of Open Access Journals (Sweden)

    Fatima K. Faek

    2015-10-01

    Full Text Available In this work, an automatic gender and age recognizer from speech is investigated. The relevant features to gender recognition are selected from the first four formant frequencies and twelve MFCCs and feed the SVM classifier. While the relevant features to age has been used with k-NN classifier for the age recognizer model, using MATLAB as a simulation tool. A special selection of robust features is used in this work to improve the results of the gender and age classifiers based on the frequency range that the feature represents. The gender and age classification algorithms are evaluated using 114 (clean and noisy speech samples uttered in Kurdish language. The model of two classes (adult males and adult females gender recognition, reached 96% recognition accuracy. While for three categories classification (adult males, adult females, and children, the model achieved 94% recognition accuracy. For the age recognition model, seven groups according to their ages are categorized. The model performance after selecting the relevant features to age achieved 75.3%. For further improvement a de-noising technique is used with the noisy speech signals, followed by selecting the proper features that are affected by the de-noising process and result in 81.44% recognition accuracy.

  16. Emotion recognition from speech: tools and challenges

    Science.gov (United States)

    Al-Talabani, Abdulbasit; Sellahewa, Harin; Jassim, Sabah A.

    2015-05-01

    Human emotion recognition from speech is studied frequently for its importance in many applications, e.g. human-computer interaction. There is a wide diversity and non-agreement about the basic emotion or emotion-related states on one hand and about where the emotion related information lies in the speech signal on the other side. These diversities motivate our investigations into extracting Meta-features using the PCA approach, or using a non-adaptive random projection RP, which significantly reduce the large dimensional speech feature vectors that may contain a wide range of emotion related information. Subsets of Meta-features are fused to increase the performance of the recognition model that adopts the score-based LDC classifier. We shall demonstrate that our scheme outperform the state of the art results when tested on non-prompted databases or acted databases (i.e. when subjects act specific emotions while uttering a sentence). However, the huge gap between accuracy rates achieved on the different types of datasets of speech raises questions about the way emotions modulate the speech. In particular we shall argue that emotion recognition from speech should not be dealt with as a classification problem. We shall demonstrate the presence of a spectrum of different emotions in the same speech portion especially in the non-prompted data sets, which tends to be more "natural" than the acted datasets where the subjects attempt to suppress all but one emotion.

  17. Fifty years of progress in speech recognition

    Science.gov (United States)

    Reddy, Raj

    2004-10-01

    Human level speech recognition has proved to be an elusive goal because of the many sources of variability that affect speech: from stationary and dynamic noise, microphone variability, and speaker variability to variability at phonetic, prosodic, and grammatical levels. Over the past 50 years, Jim Flanagan has been a continuous source of encouragement and inspiration to the speech recognition community. While early isolated word systems primarily used acoustic knowledge, systems in the 1970s found mechanisms to represent and utilize syntactic (e.g., information retrieval) and semantic knowledge (e.g., Chess) in speech recognition systems. As vocabularies became larger, leading to greater ambiguity and perplexity, we had to explore the use task specific and context specific knowledge to reduce the branching factors. As the need arose for systems that can be used by open populations using telephone quality speech, we developed learning techniques that use very large data sets and noise adaptation methods. We still have a long way to go before we can satisfactorily handle unrehearsed spontaneous speech, speech from non-native speakers, and dynamic learning of new words, phrases, and grammatical forms.

  18. Can automatic speech transcripts be used for large scale TV stream description and structuring?

    OpenAIRE

    Guinaudeau, Camille; Gravier, Guillaume; Sébillot, Pascale

    2009-01-01

    International audience The increasing quantity of TV material requires methods to help users navigate such data streams. Automatically associating a short textual description with each program in a stream, is a first stage to navigating or structuring tasks. Speech contained in TV broadcasts--accessible by means of automatic speech recognition systems in the absence of closed caption--is a highly valuable semantic clue that might be used to link existing textual description such as program...

  19. Testing for robust speech recognition performance

    Science.gov (United States)

    Simpson, C. A.; Moore, C. A.; Ruth, J. C.

    Results are reported from two studies which evaluated speaker-dependent connected-speech template-matching algorithms. One study examined the recognition performance for vocabularies spoken within a spacesuit. Two token vocabularies were used that were recorded in different noise levels. The second study evaluated the rejection accuracy for two commercial speech recognizers. The spoken test tokens were variations on a single word. The tests underscored the inferiority of speech recognizers relative to the human capability for discerning among phonetically different words. However, one commercial recognizer exhibited over 96-percent rejection accuracy in a noisy environment.

  20. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH Yong-Wan; KIM Dong-Ju; LEE Woo-Seok; HONG Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech. The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform (FFT) spectral entropy, delta FFT spectral entropy, Mel-frequency filter bank (MFB)spectral entropy, and Delta MFB spectral entropy. Spectral-based entropy features are simple. They reflect frequency characteristic and changing characteristic in frequency of speech. We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance. Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results, respectively.These scores are first obtained from a pattern recognition procedure. The pattern recognition phase uses the Gaussian mixture model (GMM). We classify the four emotional states as anger, sadness,happiness and neutrality. The proposed method is evaluated using 45 sentences in each emotion for 30 subjects, 15 males and 15 females. Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy, Zero Crossing Rate (ZCR),linear prediction coefficient (LPC), and pitch parameters. We demonstrate the effectiveness of the proposed approach. One of the proposed features, combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods. We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  1. Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features

    Directory of Open Access Journals (Sweden)

    Aleksic Petar S

    2002-01-01

    Full Text Available We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR experiments. Both single-stream and multistream hidden Markov models (HMMs were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0–30 dB with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.

  2. Connected digit speech recognition system for Malayalam language

    Indian Academy of Sciences (India)

    Cini Kurian; Kannan Balakrishnan

    2013-12-01

    A connected digit speech recognition is important in many applications such as automated banking system, catalogue-dialing, automatic data entry, automated banking system, etc. This paper presents an optimum speaker-independent connected digit recognizer for Malayalam language. The system employs Perceptual Linear Predictive (PLP) cepstral coefficient for speech parameterization and continuous density Hidden Markov Model (HMM) in the recognition process. Viterbi algorithm is used for decoding. The training data base has the utterance of 21 speakers from the age group of 20 to 40 years and the sound is recorded in the normal office environment where each speaker is asked to read 20 set of continuous digits. The system obtained an accuracy of 99.5 % with the unseen data.

  3. Phonological modeling for continuous speech recognition in Korean

    CERN Document Server

    Lee, W I; Lee, J H; Lee, WonIl; Lee, Geunbae; Lee, Jong-Hyeok

    1996-01-01

    A new scheme to represent phonological changes during continuous speech recognition is suggested. A phonological tag coupled with its morphological tag is designed to represent the conditions of Korean phonological changes. A pairwise language model of these morphological and phonological tags is implemented in Korean speech recognition system. Performance of the model is verified through the TDNN-based speech recognition experiments.

  4. Employment of Spectral Voicing Information for Speech and Speaker Recognition in Noisy Conditions

    OpenAIRE

    Jan&#;ovič, Peter; Köküer, M&#;nevver

    2008-01-01

    This chapter described our recent research on representation and modelling of speech signals for automatic speech and speaker recognition in noisy conditions. The chapter consisted of three parts. In the first part, we presented a novel method for estimation of the voicing information of speech spectra in the presence of noise. The presented method is based on calculating a similarity between the shape of signal short-term spectrum and the spectrum of the frame-analysis window. It does not re...

  5. An RBFN-based system for speaker-independent speech recognition

    OpenAIRE

    Huliehel, Fakhralden A.

    1995-01-01

    A speaker-independent isolated-word small vocabulary system is developed for applications such as voice-driven menu systems. The design of a cascade of recognition layers is presented. Several feature sets are compared. Phone recognition is performed using a radial basis function network (RBFN). Dynamic time warping (DTW) is used for word recognition. The TIMIT database is used to design and test the automatic speech recognition (ASR) system. Several feature sets using mel-s...

  6. Speech Recognition: A World of Opportunities

    Science.gov (United States)

    PACER Center, 2004

    2004-01-01

    Speech recognition technology helps people with disabilities interact with computers more easily. People with motor limitations, who cannot use a standard keyboard and mouse, can use their voices to navigate the computer and create documents. The technology is also useful to people with learning disabilities who experience difficulty with spelling…

  7. Bimodal Emotion Recognition from Speech and Text

    Directory of Open Access Journals (Sweden)

    Weilin Ye

    2014-01-01

    Full Text Available This paper presents an approach to emotion recognition from speech signals and textual content. In the analysis of speech signals, thirty-seven acoustic features are extracted from the speech input. Two different classifiers Support Vector Machines (SVMs and BP neural network are adopted to classify the emotional states. In text analysis, we use the two-step classification method to recognize the emotional states. The final emotional state is determined based on the emotion outputs from the acoustic and textual analyses. In this paper we have two parallel classifiers for acoustic information and two serial classifiers for textual information, and a final decision is made by combing these classifiers in decision level fusion. Experimental results show that the emotion recognition accuracy of the integrated system is better than that of either of the two individual approaches.

  8. Human and automatic speaker recognition over telecommunication channels

    CERN Document Server

    Fernández Gallardo, Laura

    2016-01-01

    This work addresses the evaluation of the human and the automatic speaker recognition performances under different channel distortions caused by bandwidth limitation, codecs, and electro-acoustic user interfaces, among other impairments. Its main contribution is the demonstration of the benefits of communication channels of extended bandwidth, together with an insight into how speaker-specific characteristics of speech are preserved through different transmissions. It provides sufficient motivation for considering speaker recognition as a criterion for the migration from narrowband to enhanced bandwidths, such as wideband and super-wideband.

  9. Objects Control through Speech Recognition Using LabVIEW

    OpenAIRE

    Ankush Sharma; Srinivas Perala; Priya Darshni

    2013-01-01

    Speech is the natural form of human communication and the speech processing is the one of the most stimulating area of the signal processing. Speech recognition technology has made it possible for computer to follow the human voice command and understand the human languages. The objects (LED, Toggle switch etc.) control through human speech is designed in this paper. By combine the virtual instrumentation technology and speech recognition techniques. And also provided password authentication....

  10. Novel acoustic features for speech emotion recognition

    Institute of Scientific and Technical Information of China (English)

    ROH; Yong-Wan; KIM; Dong-Ju; LEE; Woo-Seok; HONG; Kwang-Seok

    2009-01-01

    This paper focuses on acoustic features that effectively improve the recognition of emotion in human speech.The novel features in this paper are based on spectral-based entropy parameters such as fast Fourier transform(FFT) spectral entropy,delta FFT spectral entropy,Mel-frequency filter bank(MFB) spectral entropy,and Delta MFB spectral entropy.Spectral-based entropy features are simple.They reflect frequency characteristic and changing characteristic in frequency of speech.We implement an emotion rejection module using the probability distribution of recognized-scores and rejected-scores.This reduces the false recognition rate to improve overall performance.Recognized-scores and rejected-scores refer to probabilities of recognized and rejected emotion recognition results,respectively.These scores are first obtained from a pattern recognition procedure.The pattern recognition phase uses the Gaussian mixture model(GMM).We classify the four emotional states as anger,sadness,happiness and neutrality.The proposed method is evaluated using 45 sentences in each emotion for 30 subjects,15 males and 15 females.Experimental results show that the proposed method is superior to the existing emotion recognition methods based on GMM using energy,Zero Crossing Rate(ZCR),linear prediction coefficient(LPC),and pitch parameters.We demonstrate the effectiveness of the proposed approach.One of the proposed features,combined MFB and delta MFB spectral entropy improves performance approximately 10% compared to the existing feature parameters for speech emotion recognition methods.We demonstrate a 4% performance improvement in the applied emotion rejection with low confidence score.

  11. Merge-Weighted Dynamic Time Warping for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    张湘莉兰; 骆志刚; 李明

    2014-01-01

    Obtaining training material for rarely used English words and common given names from countries where English is not spoken is difficult due to excessive time, storage and cost factors. By considering personal privacy, language-independent (LI) with lightweight speaker-dependent (SD) automatic speech recognition (ASR) is a convenient option to solve the problem. The dynamic time warping (DTW) algorithm is the state-of-the-art algorithm for small-footprint SD ASR for real-time applications with limited storage and small vocabularies. These applications include voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. However, traditional DTW has several limitations, such as high computational complexity, constraint induced coarse approximation, and inaccuracy problems. In this paper, we introduce the merge-weighted dynamic time warping (MWDTW) algorithm. This method defines a template confidence index for measuring the similarity between merged training data and testing data, while following the core DTW process. MWDTW is simple, efficient, and easy to implement. With extensive experiments on three representative SD speech recognition datasets, we demonstrate that our method outperforms DTW, DTW on merged speech data, the hidden Markov model (HMM) significantly, and is also six times faster than DTW overall.

  12. A Dialectal Chinese Speech Recognition Framework

    Institute of Scientific and Technical Information of China (English)

    Jing Li; Thomas Fang Zheng; William Byrne; Dan Jurafsky

    2006-01-01

    A framework for dialectal Chinese speech recognition is proposed and studied, in which a relatively small dialectal Chinese (or in other words Chinese influenced by the native dialect) speech corpus and dialect-related knowledge are adopted to transform a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. Two kinds of knowledge sources are explored: one is expert knowledge and the other is a small dialectal Chinese corpus. These knowledge sources provide information at four levels: phonetic level, lexicon level, language level,and acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an example target language. The goal is to establish a WDC speech recognizer from an existing PTH speech recognizer based on the Initial-Final structure of the Chinese language and a study of how dialectal Chinese speakers speak Putonghua. The authors propose to use contextindependent PTH-IF mappings (where IF means either a Chinese Initial or a Chinese Final), context-independent WDC-IF mappings, and syllable-dependent WDC-IF mappings (obtained from either experts or data), and combine them with the supervised maximum likelihood linear regression (MLLR) acoustic model adaptation method. To reduce the size of the multipronunciation lexicon introduced by the IF mappings, which might also enlarge the lexicon confusion and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on the accumulated uni-gram probability (AUP) is proposed. In addition, some commonly used WDC words are selected and added to the lexicon. Compared with the original PTH speech recognizer, the resulting WDC speech recognizer achieves 10-18% absolute Character Error Rate (CER) reduction when recognizing WDC, with only a 0.62% CER increase when recognizing PTH. The proposed framework and methods are expected to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages and

  13. Arabic Speech Recognition System using CMU-Sphinx4

    CERN Document Server

    Satori, H; Chenfour, N

    2007-01-01

    In this paper we present the creation of an Arabic version of Automated Speech Recognition System (ASR). This system is based on the open source Sphinx-4, from the Carnegie Mellon University. Which is a speech recognition system based on discrete hidden Markov models (HMMs). We investigate the changes that must be made to the model to adapt Arabic voice recognition. Keywords: Speech recognition, Acoustic model, Arabic language, HMMs, CMUSphinx-4, Artificial intelligence.

  14. Speech Recognition Technology for Hearing Disabled Community

    Directory of Open Access Journals (Sweden)

    Tanvi Dua

    2014-09-01

    Full Text Available As the number of people with hearing disabilities are increasing significantly in the world, it is always required to use technology for filling the gap of communication between Deaf and Hearing communities. To fill this gap and to allow people with hearing disabilities to communicate this paper suggests a framework that contributes to the efficient integration of people with hearing disabilities. This paper presents a robust speech recognition system, which converts the continuous speech into text and image. The results are obtained with an accuracy of 95% with the small size vocabulary of 20 greeting sentences of continuous speech form tested in a speaker independent mode. In this testing phase all these continuous sentences were given as live input to the proposed system.

  15. Speech emotion recognition with unsupervised feature learning

    Institute of Scientific and Technical Information of China (English)

    Zheng-wei HUANG; Wen-tao XUE; Qi-rong MAO

    2015-01-01

    Emotion-based features are critical for achieving high performance in a speech emotion recognition (SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms (including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.

  16. Self Organizing Markov Map for Speech and Gesture Recognition

    Directory of Open Access Journals (Sweden)

    Ms. Nutan D. Sonwane, Prof. S.A. Chhabria, Dr. R.V. Dharaskar

    2012-04-01

    Full Text Available Gesture and Speech based human Computer interaction is attractive attention across various areas such as pattern recognition, computer vision. Thus kind of research areas find many kind of application in Multimodal HCI, Robotics control, Sign language recognition. This paper presents head and hand Gesture as well as Speech recognition system for human computer interaction (HCI.This kind of vision based system can show the capability of computer, which understand and responding to the hand and head gesture also for Speech in form of sentence. This recognition system consists of two main modules namely 1.Gesture recognition 2.Speech recognition, Gesture recognition consists of various phases.i. image capturing, ii. Feature extraction of gesture iii.Gesture modeling (Direction, Position, generalized, 2.Speech recognition consists of various phases i. taking voice signals ii. Spectral coding iii. Unit matching (BMU iv. Lexical decoding v.syntactic, semantic analysis. Compared with many existing algorithms for gesture and speech recognition, SOM provides flexibility, robustness against noisy environment. The detection of gestures is based on discrete predestinated symbol sets, which are manually labeled during the training phase. The gesture-speech correlation is modelled by examining the co-occurring speech and gesture patterns. This correlation can be used to fuse gesture and speech modalities for edutainment applications (i.e. video games, 3-D animations where natural gestures of talking avatars are animated from speech. A speech driven gesture animation example has been implemented for demonstration.

  17. Hidden Conditional Neural Fields for Continuous Phoneme Speech Recognition

    Science.gov (United States)

    Fujii, Yasuhisa; Yamamoto, Kazumasa; Nakagawa, Seiichi

    In this paper, we propose Hidden Conditional Neural Fields (HCNF) for continuous phoneme speech recognition, which are a combination of Hidden Conditional Random Fields (HCRF) and a Multi-Layer Perceptron (MLP), and inherit their merits, namely, the discriminative property for sequences from HCRF and the ability to extract non-linear features from an MLP. HCNF can incorporate many types of features from which non-linear features can be extracted, and is trained by sequential criteria. We first present the formulation of HCNF and then examine three methods to further improve automatic speech recognition using HCNF, which is an objective function that explicitly considers training errors, provides a hierarchical tandem-style feature and includes a deep non-linear feature extractor for the observation function. We show that HCNF can be trained realistically without any initial model and outperforms HCRF and the triphone hidden Markov model trained by the minimum phone error (MPE) manner using experimental results for continuous English phoneme recognition on the TIMIT core test set and Japanese phoneme recognition on the IPA 100 test set.

  18. Development an Automatic Speech to Facial Animation Conversion for Improve Deaf Lives

    Directory of Open Access Journals (Sweden)

    S. Hamidreza Kasaei

    2011-05-01

    Full Text Available In this paper, we propose design and initial implementation of a robust system which can automatically translates voice into text and text to sign language animations. Sign Language
    Translation Systems could significantly improve deaf lives especially in communications, exchange of information and employment of machine for translation conversations from one language to another has. Therefore, considering these points, it seems necessary to study the speech recognition. Usually, the voice recognition algorithms address three major challenges. The first is extracting feature form speech and the second is when limited sound gallery are available for recognition, and the final challenge is to improve speaker dependent to speaker independent voice recognition. Extracting feature form speech is an important stage in our method. Different procedures are available for extracting feature form speech. One of the commonest of which used in speech
    recognition systems is Mel-Frequency Cepstral Coefficients (MFCCs. The algorithm starts with preprocessing and signal conditioning. Next extracting feature form speech using Cepstral coefficients will be done. Then the result of this process sends to segmentation part. Finally recognition part recognizes the words and then converting word recognized to facial animation. The project is still in progress and some new interesting methods are described in the current report.

  19. Support vector machine for automatic pain recognition

    Science.gov (United States)

    Monwar, Md Maruf; Rezaei, Siamak

    2009-02-01

    Facial expressions are a key index of emotion and the interpretation of such expressions of emotion is critical to everyday social functioning. In this paper, we present an efficient video analysis technique for recognition of a specific expression, pain, from human faces. We employ an automatic face detector which detects face from the stored video frame using skin color modeling technique. For pain recognition, location and shape features of the detected faces are computed. These features are then used as inputs to a support vector machine (SVM) for classification. We compare the results with neural network based and eigenimage based automatic pain recognition systems. The experiment results indicate that using support vector machine as classifier can certainly improve the performance of automatic pain recognition system.

  20. Compact Acoustic Models for Embedded Speech Recognition

    Directory of Open Access Journals (Sweden)

    Christophe Lévy

    2009-01-01

    Full Text Available Speech recognition applications are known to require a significant amount of resources. However, embedded speech recognition only authorizes few KB of memory, few MIPS, and small amount of training data. In order to fit the resource constraints of embedded applications, an approach based on a semicontinuous HMM system using state-independent acoustic modelling is proposed. A transformation is computed and applied to the global model in order to obtain each HMM state-dependent probability density functions, authorizing to store only the transformation parameters. This approach is evaluated on two tasks: digit and voice-command recognition. A fast adaptation technique of acoustic models is also proposed. In order to significantly reduce computational costs, the adaptation is performed only on the global model (using related speaker recognition adaptation techniques with no need for state-dependent data. The whole approach results in a relative gain of more than 20% compared to a basic HMM-based system fitting the constraints.

  1. Cross-Word Modeling for Arabic Speech Recognition

    CERN Document Server

    AbuZeina, Dia

    2012-01-01

    "Cross-Word Modeling for Arabic Speech Recognition" utilizes phonological rules in order to model the cross-word problem, a merging of adjacent words in speech caused by continuous speech, to enhance the performance of continuous speech recognition systems. The author aims to provide an understanding of the cross-word problem and how it can be avoided, specifically focusing on Arabic phonology using an HHM-based classifier.

  2. Joint speech and spearker recognition using neural networks

    OpenAIRE

    Xue, Xiaoguo

    2013-01-01

    Speech is the main communication method between human beings. Since the time of the invention of the computer people have been trying to let the computer understand natural speech. Speech recognition is a technology which has close connections with computer science, signal processing, voice linguistics and intelligent systems. It has been a ”hot” subject not only in the field of research but also as a practical application. Especially in real life, speaker and speech recognition have been use...

  3. Automatic recognition of element classes and boundaries in the birdsong with variable sequences

    OpenAIRE

    Koumura, Takuya; Okanoya, Kazuo

    2016-01-01

    Researches on sequential vocalization often require analysis of vocalizations in long continuous sounds. In such studies as developmental ones or studies across generations in which days or months of vocalizations must be analyzed, methods for automatic recognition would be strongly desired. Although methods for automatic speech recognition for application purposes have been intensively studied, blindly applying them for biological purposes may not be an optimal solution. This is because, unl...

  4. Performance of current models of speech recognition and resulting challenges

    OpenAIRE

    Schubotz, Wiebke

    2015-01-01

    Speech is usually perceived in background noise (masker) that can severely hamper its recognition. Nevertheless, there are mechanisms that enable speech recognition even in difficult listening conditions. Some of them, such as e.g., the combination of across-frequency information or binaural cues, are studied in this dissertation. Moreover, masking aspects such as energetic, amplitude modulation or informational masking are considered. Speech recognition in complex maskers is investigated tha...

  5. Better speech recognition with cochlear implants

    Science.gov (United States)

    Wilson, Blake S.; Finley, Charles C.; Lawson, Dewey T.; Wolford, Robert D.; Eddington, Donald K.; Rabinowitz, William M.

    1991-07-01

    HIGH levels of speech recognition have been achieved with a new sound processing strategy for multielectrode cochlear implants. A cochlear implant system consists of one or more implanted elec-trodes for direct electrical activation of the auditory nerve, an external speech processor that transforms a microphone input into stimuli for each electrode, and a transcutaneous (rf-link) or per-cutaneous (direct) connection between the processor and the elec-trodes. We report here the comparison of the new strategy and a standard clinical processor. The standard compressed analogue (CA) processor1,2 presented analogue waveforms simultaneously to all electrodes, whereas the new continuous interleaved sampling (CIS) strategy presented brief pulses to each electrode in a nonover-lapping sequence. Seven experienced implant users, selected for their excellent performance with the CA processor, participated as subjects. The new strategy produced large improvements in the scores of speech reception tests for all subjects. These results have important implications for the treatment of deafness and for minimal representations of speech at the auditory periphery.

  6. An Improved Hindi Speech Emotion Recognition System

    Directory of Open Access Journals (Sweden)

    Agnes Jacob

    2013-11-01

    Full Text Available This paper presents the results of investigations in speech emotion recognition in Hindi, using only the first four formants and their bandwidths. This research work was done on female speech data base of nearly 1600 utterances comprising neutral, happiness, surprise, anger, sadness, fear and disgust as the elicited emotions. The best of the statistically preprocessed formant and bandwidth features were first identified by the KMeans, K nearest Neighbour and Naive Bayes classification of individual features. This was followed by artificial neural network classification based on the combination of the best formants and bandwidths. The highest overall emotion recognition accuracy obtained by the ANN method was 97.14%, based on the first four values of formants and bandwidths. A striking increase in the recognition accuracy was observed when the number of emotion classes was reduced from seven. The obtained results presented in this paper, have not been reported so far for Hindi, using the proposed spectral features as well as with the adopted preprocessing and classification methods.

  7. Speech Recognition Technology Applied to Intelligent Mobile Navigation System

    Institute of Scientific and Technical Information of China (English)

    2002-01-01

    The capability of human-computer interaction reflects the intelligent degree of mobile navigation system.The navigation data and functions of mobile navigation system are divided into system commands and non-system commands in this paper.And then a group of speech commands are Abstracted.This paper applies speech recognition technology to intelligent mobile navigation system to process speech commands and does some deep research on the integration of speech recognition technology with mobile navigation system.The navigation operation can be performed by speech commands,which makes human-computer interaction easy during navigation.Speech command interface of navigation system is implemented by Dutty ++ Software,which is based on speech recognition system -Via Voice of IBM.Through navigation experiments,navigation can be done almost without keyboard,which proved that human-computer interaction is very convenient by speech commands and the reliability is also higher.

  8. Automatic modulation recognition of communication signals

    CERN Document Server

    Azzouz, Elsayed Elsayed

    1996-01-01

    Automatic modulation recognition is a rapidly evolving area of signal analysis. In recent years, interest from the academic and military research institutes has focused around the research and development of modulation recognition algorithms. Any communication intelligence (COMINT) system comprises three main blocks: receiver front-end, modulation recogniser and output stage. Considerable work has been done in the area of receiver front-ends. The work at the output stage is concerned with information extraction, recording and exploitation and begins with signal demodulation, that requires accurate knowledge about the signal modulation type. There are, however, two main reasons for knowing the current modulation type of a signal; to preserve the signal information content and to decide upon the suitable counter action, such as jamming. Automatic Modulation Recognition of Communications Signals describes in depth this modulation recognition process. Drawing on several years of research, the authors provide a cr...

  9. A pattern recognition based esophageal speech enhancement system

    Directory of Open Access Journals (Sweden)

    A.Mantilla‐Caeiros

    2010-04-01

    Full Text Available A system for improving the intelligibility and quality of alaryngeal speech based on the replacement of voiced segments ofalaryngeal speech with the equivalent segments of normal speech is proposed. To this end, the system proposed identifies thevoiced segments of the alaryngeal speech signal by using isolate speech recognition methods, and replaces them by theirequivalent voiced segments of normal speech, keeping the silence and unvoiced segments without change. Evaluation resultsusing objective and subjective evaluation methods show that the proposed system proposed provides a fairly goodimprovement of the quality and intelligibility of alaryngeal speech signals.

  10. Speech-based recognition of self-reported and observed emotion in a dimensional space

    NARCIS (Netherlands)

    Truong, Khiet P.; Leeuwen, van David A.; Jong, de Franciska M.G.

    2012-01-01

    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two t

  11. Speech and audio processing for coding, enhancement and recognition

    CERN Document Server

    Togneri, Roberto; Narasimha, Madihally

    2015-01-01

    This book describes the basic principles underlying the generation, coding, transmission and enhancement of speech and audio signals, including advanced statistical and machine learning techniques for speech and speaker recognition with an overview of the key innovations in these areas. Key research undertaken in speech coding, speech enhancement, speech recognition, emotion recognition and speaker diarization are also presented, along with recent advances and new paradigms in these areas. ·         Offers readers a single-source reference on the significant applications of speech and audio processing to speech coding, speech enhancement and speech/speaker recognition. Enables readers involved in algorithm development and implementation issues for speech coding to understand the historical development and future challenges in speech coding research; ·         Discusses speech coding methods yielding bit-streams that are multi-rate and scalable for Voice-over-IP (VoIP) Networks; ·     �...

  12. How does real affect affect affect recognition in speech?

    NARCIS (Netherlands)

    Truong, Khiet Phuong

    2009-01-01

    The aim of the research described in this thesis was to develop speech-based affect recognition systems that can deal with spontaneous (‘real’) affect instead of acted affect. Several affect recognition experiments with spontaneous affective speech data were carried out to investigate what combinati

  13. Building DNN Acoustic Models for Large Vocabulary Speech Recognition

    OpenAIRE

    Maas, Andrew L.; Qi, Peng; Xie, Ziang; Hannun, Awni Y.; Lengerich, Christopher T.; Jurafsky, Daniel; Ng, Andrew Y.

    2014-01-01

    Deep neural networks (DNNs) are now a central component of nearly all state-of-the-art speech recognition systems. Building neural network acoustic models requires several design decisions including network architecture, size, and training loss function. This paper offers an empirical investigation on which aspects of DNN acoustic model design are most important for speech recognition system performance. We report DNN classifier performance and final speech recognizer word error rates, and co...

  14. Introduction to Arabic Speech Recognition Using CMUSphinx System

    CERN Document Server

    Satori, H; Chenfour, N

    2007-01-01

    In this paper Arabic was investigated from the speech recognition problem point of view. We propose a novel approach to build an Arabic Automated Speech Recognition System (ASR). This system is based on the open source CMU Sphinx-4, from the Carnegie Mellon University. CMU Sphinx is a large-vocabulary; speaker-independent, continuous speech recognition system based on discrete Hidden Markov Models (HMMs). We build a model using utilities from the OpenSource CMU Sphinx. We will demonstrate the possible adaptability of this system to Arabic voice recognition.

  15. Improving speech recognition on a mobile robot platform through the use of top-down visual queues

    OpenAIRE

    Ross, Robert; O'Donoghue, R. P. S.; O'Hare, G. M. P.

    2003-01-01

    In many real-world environments, Automatic Speech Recognition (ASR) technologies fail to provide adequate performance for applications such as human robot dialog. Despite substantial evidence that speech recognition in humans is performed in a top-down as well as bottom-up manner, ASR systems typically fail to capitalize on this, instead relying on a purely statistical, bottom up methodology. In this paper we advocate the use of a knowledge based approach to improving ASR in domains such as m...

  16. Automatic Gait Recognition by Symmetry Analysis

    OpenAIRE

    Hayfron-Acquah, James B.; Nixon, Mark S.; Carter, John N.

    2001-01-01

    We describe a new method for automatic gait recognition based on analysing the symmetry of human motion, by using the Generalised Symmetry Operator. This operator, rather than relying on the borders of a shape or on general appearance, locates features by their symmetrical properties. This approach is reinforced by the psychologists' view that gait is a symmetrical pattern of motion and by other works. We applied our new method to two different databases and derived gait signatures for silhou...

  17. Speech recognition algorithms based on weighted finite-state transducers

    CERN Document Server

    Hori, Takaaki

    2013-01-01

    This book introduces the theory, algorithms, and implementation techniques for efficient decoding in speech recognition mainly focusing on the Weighted Finite-State Transducer (WFST) approach. The decoding process for speech recognition is viewed as a search problem whose goal is to find a sequence of words that best matches an input speech signal. Since this process becomes computationally more expensive as the system vocabulary size increases, research has long been devoted to reducing the computational cost. Recently, the WFST approach has become an important state-of-the-art speech recogni

  18. An articulatorily constrained, maximum entropy approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-12-31

    Hidden Markov models (HMM`s) are among the most popular tools for performing computer speech recognition. One of the primary reasons that HMM`s typically outperform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. This makes HMM`s better able to deal with intra- and inter-speaker variability despite the limited knowledge of how speech signals vary and despite the often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values are constrained using the limited knowledge of speech, recognition performance decreases. However, the structure of an HMM has little in common with the mechanisms underlying speech production. Here, the author argues that by using probabilistic models that more accurately embody the process of speech production, he can create models that have all the advantages of HMM`s, but that should more accurately capture the statistical properties of real speech samples--presumably leading to more accurate speech recognition. The model he will discuss uses the fact that speech articulators move smoothly and continuously. Before discussing how to use articulatory constraints, he will give a brief description of HMM`s. This will allow him to highlight the similarities and differences between HMM`s and the proposed technique.

  19. Wavelet Cesptral Coefficients for Isolated Speech Recognition

    Directory of Open Access Journals (Sweden)

    T. B. Adam

    2013-05-01

    Full Text Available The study proposes an improved feature extraction method that is called Wavelet Cepstral Coefficients (WCC. In traditional cepstral analysis, the cepstrums are calculated with the use of the Discrete Fourier Transform (DFT. Owing to the fact that the DFT calculation assumes signal stationary between frames which in practice is not quite true, the WCC replaces the DFT block in the traditional cepstrum calculation with the Discrete Wavelet Transform (DWT hence producing the WCC. To evaluate the proposed WCC, speech recognition task of recognizing the 26 English alphabets were conducted. Comparisons with the traditional Mel-Frequency Cepstral Coefficients (MFCC are done to further analyze the effectiveness of the WCCs. It is found that the WCCs showed some comparable results when compared to the MFCCs considering the WCCs small vector dimension when compared to the MFCCs. The best recognition was found from WCCs at level 5 of the DWT decomposition with a small difference of 1.19% and 3.21% when compared to the MFCCs for speaker independent and speaker dependent tasks respectively.

  20. Automatic Arabic Hand Written Text Recognition System

    Directory of Open Access Journals (Sweden)

    I. A. Jannoud

    2007-01-01

    Full Text Available Despite of the decent development of the pattern recognition science applications in the last decade of the twentieth century and this century, text recognition remains one of the most important problems in pattern recognition. To the best of our knowledge, little work has been done in the area of Arabic text recognition compared with those for Latin, Chins and Japanese text. The main difficulty encountered when dealing with Arabic text is the cursive nature of Arabic writing in both printed and handwritten forms. An Automatic Arabic Hand-Written Text Recognition (AHTR System is proposed. An efficient segmentation stage is required in order to divide a cursive word or sub-word into its constituting characters. After a word has been extracted from the scanned image, it is thinned and its base line is calculated by analysis of horizontal density histogram. The pattern is then followed through the base line and the segmentation points are detected. Thus after the segmentation stage, the cursive word is represented by a sequence of isolated characters. The recognition problem thus reduces to that of classifying each character. A set of features extracted from each individual characters. A minimum distance classifier is used. Some approaches are used for processing the characters and post processing added to enhance the results. Recognized characters will be appended directly to a word file which is editable form.

  1. Unification of automatic target tracking and automatic target recognition

    Science.gov (United States)

    Schachter, Bruce J.

    2014-06-01

    The subject being addressed is how an automatic target tracker (ATT) and an automatic target recognizer (ATR) can be fused together so tightly and so well that their distinctiveness becomes lost in the merger. This has historically not been the case outside of biology and a few academic papers. The biological model of ATT∪ATR arises from dynamic patterns of activity distributed across many neural circuits and structures (including retina). The information that the brain receives from the eyes is "old news" at the time that it receives it. The eyes and brain forecast a tracked object's future position, rather than relying on received retinal position. Anticipation of the next moment - building up a consistent perception - is accomplished under difficult conditions: motion (eyes, head, body, scene background, target) and processing limitations (neural noise, delays, eye jitter, distractions). Not only does the human vision system surmount these problems, but it has innate mechanisms to exploit motion in support of target detection and classification. Biological vision doesn't normally operate on snapshots. Feature extraction, detection and recognition are spatiotemporal. When vision is viewed as a spatiotemporal process, target detection, recognition, tracking, event detection and activity recognition, do not seem as distinct as they are in current ATT and ATR designs. They appear as similar mechanism taking place at varying time scales. A framework is provided for unifying ATT and ATR.

  2. Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

    Science.gov (United States)

    Xiao, Xiong; Zhao, Shengkui; Ha Nguyen, Duc Hoang; Zhong, Xionghu; Jones, Douglas L.; Chng, Eng Siong; Li, Haizhou

    2016-01-01

    This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.

  3. Emotion Recognition from Speech Signals and Perception of Music

    OpenAIRE

    Fernandez Pradier, Melanie

    2011-01-01

    This thesis deals with emotion recognition from speech signals. The feature extraction step shall be improved by looking at the perception of music. In music theory, different pitch intervals (consonant, dissonant) and chords are believed to invoke different feelings in listeners. The question is whether there is a similar mechanism between perception of music and perception of emotional speech. Our research will follow three stages. First, the relationship between speech and music at segment...

  4. Effects of Speech Clarity on Recognition Memory for Spoken Sentences

    OpenAIRE

    Van Engen, Kristin J.; Bharath Chandrasekaran; Rajka Smiljanic

    2012-01-01

    Extensive research shows that inter-talker variability (i.e., changing the talker) affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker) on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communi...

  5. SPEECH EMOTION RECOGNITION USING MODIFIED QUADRATIC DISCRIMINATION FUNCTION

    Institute of Scientific and Technical Information of China (English)

    2008-01-01

    Quadratic Discrimination Function(QDF)is commonly used in speech emotion recognition,which proceeds on the premise that the input data is normal distribution.In this Paper,we propose a transformation to normalize the emotional features,then derivate a Modified QDF(MQDF) to speech emotion recognition.Features based on prosody and voice quality are extracted and Principal Component Analysis Neural Network (PCANN) is used to reduce dimension of the feature vectors.The results show that voice quality features are effective supplement for recognition.and the method in this paper could improve the recognition ratio effectively.

  6. Source Separation via Spectral Masking for Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Gustavo Fernandes Rodrigues

    2012-12-01

    Full Text Available In this paper we present an insight into the use of spectral masking techniques in time-frequency domain, as a preprocessing step for the speech signal recognition. Speech recognition systems have their performance negatively affected in noisy environments or in the presence of other speech signals. The limits of these masking techniques for different levels of the signal-to-noise ratio are discussed. We show the robustness of the spectral masking techniques against four types of noise: white, pink, brown and human speech noise (bubble noise. The main contribution of this work is to analyze the performance limits of recognition systems  using spectral masking. We obtain an increase of 18% on the speech hit rate, when the speech signals were corrupted by other speech signals or bubble noise, with different signal-to-noise ratio of approximately 1, 10 and 20 dB. On the other hand, applying the ideal binary masks to mixtures corrupted by white, pink and brown noise, results an average growth of 9% on the speech hit rate, with the same different signal-to-noise ratio. The experimental results suggest that the masking spectral techniques are more suitable for the case when it is applied a bubble noise, which is produced by human speech, than for the case of applying white, pink and brown noise.

  7. Speech recognition for 40 patients receiving multichannel cochlear implants.

    Science.gov (United States)

    Dowell, R C; Mecklenburg, D J; Clark, G M

    1986-10-01

    We collected data on 40 patients who received the Nucleus multichannel cochlear implant. Results were reviewed to determine if the coding strategy is effective in transmitting the intended speech features and to assess patient benefit in terms of communication skills. All patients demonstrated significant improvement over preoperative results with a hearing aid for both lipreading enhancement and speech recognition without lipreading. Of the patients, 50% demonstrated ability to understand connected discourse with auditory input only. For the 23 patients who were tested 12 months postoperatively, there was substantial improvement in open-set speech recognition. PMID:3755975

  8. Speech Recognition Oriented Vowel Classification Using Temporal Radial Basis Functions

    CERN Document Server

    Guezouri, Mustapha; Benyettou, Abdelkader

    2009-01-01

    The recent resurgence of interest in spatio-temporal neural network as speech recognition tool motivates the present investigation. In this paper an approach was developed based on temporal radial basis function "TRBF" looking to many advantages: few parameters, speed convergence and time invariance. This application aims to identify vowels taken from natural speech samples from the Timit corpus of American speech. We report a recognition accuracy of 98.06 percent in training and 90.13 in test on a subset of 6 vowel phonemes, with the possibility to expend the vowel sets in future.

  9. Automatic discrimination between laughter and speech

    NARCIS (Netherlands)

    Truong, K.; Leeuwen, D. van

    2007-01-01

    Emotions can be recognized by audible paralinguistic cues in speech. By detecting these paralinguistic cues that can consist of laughter, a trembling voice, coughs, changes in the intonation contour etc., information about the speaker’s state and emotion can be revealed. This paper describes the dev

  10. Effects of speech clarity on recognition memory for spoken sentences.

    Directory of Open Access Journals (Sweden)

    Kristin J Van Engen

    Full Text Available Extensive research shows that inter-talker variability (i.e., changing the talker affects recognition memory for speech signals. However, relatively little is known about the consequences of intra-talker variability (i.e. changes in speaking style within a talker on the encoding of speech signals in memory. It is well established that speakers can modulate the characteristics of their own speech and produce a listener-oriented, intelligibility-enhancing speaking style in response to communication demands (e.g., when speaking to listeners with hearing impairment or non-native speakers of the language. Here we conducted two experiments to examine the role of speaking style variation in spoken language processing. First, we examined the extent to which clear speech provided benefits in challenging listening environments (i.e. speech-in-noise. Second, we compared recognition memory for sentences produced in conversational and clear speaking styles. In both experiments, semantically normal and anomalous sentences were included to investigate the role of higher-level linguistic information in the processing of speaking style variability. The results show that acoustic-phonetic modifications implemented in listener-oriented speech lead to improved speech recognition in challenging listening conditions and, crucially, to a substantial enhancement in recognition memory for sentences.

  11. Mandarin Digits Speech Recognition Using Support Vector Machines

    Institute of Scientific and Technical Information of China (English)

    XIE Xiang; KUANG Jing-ming

    2005-01-01

    A method of applying support vector machine (SVM) in speech recognition was proposed, and a speech recognition system for mandarin digits was built up by SVMs. In the system, vectors were linearly extracted from speech feature sequence to make up time-aligned input patterns for SVM, and the decisions of several 2-class SVM classifiers were employed for constructing an N-class classifier. Four kinds of SVM kernel functions were compared in the experiments of speaker-independent speech recognition of mandarin digits. And the kernel of radial basis function has the highest accurate rate of 99.33%, which is better than that of the baseline system based on hidden Markov models (HMM) (97.08%). And the experiments also show that SVM can outperform HMM especially when the samples for learning were very limited.

  12. Post-Editing Error Correction Algorithm for Speech Recognition using Bing Spelling Suggestion

    CERN Document Server

    Bassil, Youssef

    2012-01-01

    ASR short for Automatic Speech Recognition is the process of converting a spoken speech into text that can be manipulated by a computer. Although ASR has several applications, it is still erroneous and imprecise especially if used in a harsh surrounding wherein the input speech is of low quality. This paper proposes a post-editing ASR error correction method and algorithm based on Bing's online spelling suggestion. In this approach, the ASR recognized output text is spell-checked using Bing's spelling suggestion technology to detect and correct misrecognized words. More specifically, the proposed algorithm breaks down the ASR output text into several word-tokens that are submitted as search queries to Bing search engine. A returned spelling suggestion implies that a query is misspelled; and thus it is replaced by the suggested correction; otherwise, no correction is performed and the algorithm continues with the next token until all tokens get validated. Experiments carried out on various speeches in differen...

  13. Automatic transcription of continuous speech into syllable-like units for Indian languages

    Indian Academy of Sciences (India)

    G Lakshmi Sarada; A Lakshmi; Hema A Murthy; T Nagarajan

    2009-04-01

    The focus of this paper is to automatically segment and label continuous speech signal into syllable-like units for Indian languages. In this approach, the continuous speech signal is first automatically segmented into syllable-like units using group delay based algorithm. Similar syllable segments are then grouped together using an unsupervised and incremental training (UIT) technique. Isolated style HMM models are generated for each of the clusters during training. During testing, the speech signal is segmented into syllable-like units which are then tested against the HMMs obtained during training. This results in a syllable recognition performance of 42·6% and 39·94% for Tamil and Telugu. A new feature extraction technique that uses features extracted from multiple frame sizes and frame rates during both training and testing is explored for the syllable recognition task. This results in a recognition performance of 48·7% and 45·36%, for Tamil and Telugu respectively. The performance of segmentation followed by labelling is superior to that of a flat start syllable recogniser (27·8% and 28·8% for Tamil and Telugu respectively).

  14. Microphone Array Speech Recognition : Experiments on Overlapping Speech in Meetings

    OpenAIRE

    Moore, Darren; McCowan, Iain A.

    2002-01-01

    This paper investigates the use of microphone arrays to acquire and recognise speech in meetings. Meetings pose several interesting problems for speech processing, as they consist of multiple competing speakers within a small space, typically around a table. Due to their ability to provide hands-free acquisition and directional discrimination, microphone arrays present a potential alternative to close-talking microphones in such an application. We first propose an appropriate microphone array...

  15. Explanation mode for Bayesian automatic object recognition

    Science.gov (United States)

    Hazlett, Thomas L.; Cofer, Rufus H.; Brown, Harold K.

    1992-09-01

    One of the more useful techniques to emerge from AI is the provision of an explanation modality used by the researcher to understand and subsequently tune the reasoning of an expert system. Such a capability, missing in the arena of statistical object recognition, is not that difficult to provide. Long standing results show that the paradigm of Bayesian object recognition is truly optimal in a minimum probability of error sense. To a large degree, the Bayesian paradigm achieves optimality through adroit fusion of a wide range of lower informational data sources to give a higher quality decision--a very 'expert system' like capability. When various sources of incoming data are represented by C++ classes, it becomes possible to automatically backtrack the Bayesian data fusion process, assigning relative weights to the more significant datums and their combinations. A C++ object oriented engine is then able to synthesize 'English' like textural description of the Bayesian reasoning suitable for generalized presentation. Key concepts and examples are provided based on an actual object recognition problem.

  16. A Multi-Modal Recognition System Using Face and Speech

    OpenAIRE

    Samir Akrouf; Belayadi Yahia; Mostefai Messaoud; Youssef Chahir

    2011-01-01

    Nowadays Person Recognition has got more and more interest especially for security reasons. The recognition performed by a biometric system using a single modality tends to be less performing due to sensor data, restricted degrees of freedom and unacceptable error rates. To alleviate some of these problems we use multimodal biometric systems which provide better recognition results. By combining different modalities, such us speech, face, fingerprint, etc., we increase the performance of reco...

  17. Towards automatic musical instrument timbre recognition

    Science.gov (United States)

    Park, Tae Hong

    This dissertation is comprised of two parts---focus on issues concerning research and development of an artificial system for automatic musical instrument timbre recognition and musical compositions. The technical part of the essay includes a detailed record of developed and implemented algorithms for feature extraction and pattern recognition. A review of existing literature introducing historical aspects surrounding timbre research, problems associated with a number of timbre definitions, and highlights of selected research activities that have had significant impact in this field are also included. The developed timbre recognition system follows a bottom-up, data-driven model that includes a pre-processing module, feature extraction module, and a RBF/EBF (Radial/Elliptical Basis Function) neural network-based pattern recognition module. 829 monophonic samples from 12 instruments have been chosen from the Peter Siedlaczek library (Best Service) and other samples from the Internet and personal collections. Significant emphasis has been put on feature extraction development and testing to achieve robust and consistent feature vectors that are eventually passed to the neural network module. In order to avoid a garbage-in-garbage-out (GIGO) trap and improve generality, extra care was taken in designing and testing the developed algorithms using various dynamics, different playing techniques, and a variety of pitches for each instrument with inclusion of attack and steady-state portions of a signal. Most of the research and development was conducted in Matlab. The compositional part of the essay includes brief introductions to "A d'Ess Are ," "Aboji," "48 13 N, 16 20 O," and "pH-SQ." A general outline pertaining to the ideas and concepts behind the architectural designs of the pieces including formal structures, time structures, orchestration methods, and pitch structures are also presented.

  18. ISOLATED SPEECH RECOGNITION SYSTEM FOR TAMIL LANGUAGE USING STATISTICAL PATTERN MATCHING AND MACHINE LEARNING TECHNIQUES

    Directory of Open Access Journals (Sweden)

    VIMALA C.

    2015-05-01

    Full Text Available In recent years, speech technology has become a vital part of our daily lives. Various techniques have been proposed for developing Automatic Speech Recognition (ASR system and have achieved great success in many applications. Among them, Template Matching techniques like Dynamic Time Warping (DTW, Statistical Pattern Matching techniques such as Hidden Markov Model (HMM and Gaussian Mixture Models (GMM, Machine Learning techniques such as Neural Networks (NN, Support Vector Machine (SVM, and Decision Trees (DT are most popular. The main objective of this paper is to design and develop a speaker-independent isolated speech recognition system for Tamil language using the above speech recognition techniques. The background of ASR system, the steps involved in ASR, merits and demerits of the conventional and machine learning algorithms and the observations made based on the experiments are presented in this paper. For the above developed system, highest word recognition accuracy is achieved with HMM technique. It offered 100% accuracy during training process and 97.92% for testing process.

  19. Robust automatic target recognition in FLIR imagery

    Science.gov (United States)

    Soyman, Yusuf

    2012-05-01

    In this paper, a robust automatic target recognition algorithm in FLIR imagery is proposed. Target is first segmented out from the background using wavelet transform. Segmentation process is accomplished by parametric Gabor wavelet transformation. Invariant features that belong to the target, which is segmented out from the background, are then extracted via moments. Higher-order moments, while providing better quality for identifying the image, are more sensitive to noise. A trade-off study is then performed on a few moments that provide effective performance. Bayes method is used for classification, using Mahalanobis distance as the Bayes' classifier. Results are assessed based on false alarm rates. The proposed method is shown to be robust against rotations, translations and scale effects. Moreover, it is shown to effectively perform under low-contrast objects in FLIR images. Performance comparisons are also performed on both GPU and CPU. Results indicate that GPU has superior performance over CPU.

  20. Recognition of Emotions in Mexican Spanish Speech: An Approach Based on Acoustic Modelling of Emotion-Specific Vowels

    Directory of Open Access Journals (Sweden)

    Santiago-Omar Caballero-Morales

    2013-01-01

    Full Text Available An approach for the recognition of emotions in speech is presented. The target language is Mexican Spanish, and for this purpose a speech database was created. The approach consists in the phoneme acoustic modelling of emotion-specific vowels. For this, a standard phoneme-based Automatic Speech Recognition (ASR system was built with Hidden Markov Models (HMMs, where different phoneme HMMs were built for the consonants and emotion-specific vowels associated with four emotional states (anger, happiness, neutral, sadness. Then, estimation of the emotional state from a spoken sentence is performed by counting the number of emotion-specific vowels found in the ASR’s output for the sentence. With this approach, accuracy of 87–100% was achieved for the recognition of emotional state of Mexican Spanish speech.

  1. A Multi-Modal Recognition System Using Face and Speech

    Directory of Open Access Journals (Sweden)

    Samir Akrouf

    2011-05-01

    Full Text Available Nowadays Person Recognition has got more and more interest especially for security reasons. The recognition performed by a biometric system using a single modality tends to be less performing due to sensor data, restricted degrees of freedom and unacceptable error rates. To alleviate some of these problems we use multimodal biometric systems which provide better recognition results. By combining different modalities, such us speech, face, fingerprint, etc., we increase the performance of recognition systems. In this paper, we study the fusion of speech and face in a recognition system for taking a final decision (i.e., accept or reject identity claim. We evaluate the performance of each system differently then we fuse the results and compare the performances.

  2. Speech recognition: Acoustic phonetic and lexical knowledge representation

    Science.gov (United States)

    Zue, V. W.

    1984-02-01

    The purpose of this program is to develop a speech data base facility under which the acoustic characteristics of speech sounds in various contexts can be studied conveniently; investigate the phonological properties of a large lexicon of, say 10,000 words and determine to what extent the phonotactic constraints can be utilized in speech recognition; study the acoustic cues that are used to mark work boundaries; develop a test bed in the form of a large-vocabulary, IWR system to study the interactions of acoustic, phonetic and lexical knowledge; and develop a limited continuous speech recognition system with the goal of recognizing any English word from its spelling in order to assess the interactions of higher-level knowledge sources.

  3. Image simulation for automatic license plate recognition

    Science.gov (United States)

    Bala, Raja; Zhao, Yonghui; Burry, Aaron; Kozitsky, Vladimir; Fillion, Claude; Saunders, Craig; Rodríguez-Serrano, José

    2012-01-01

    Automatic license plate recognition (ALPR) is an important capability for traffic surveillance applications, including toll monitoring and detection of different types of traffic violations. ALPR is a multi-stage process comprising plate localization, character segmentation, optical character recognition (OCR), and identification of originating jurisdiction (i.e. state or province). Training of an ALPR system for a new jurisdiction typically involves gathering vast amounts of license plate images and associated ground truth data, followed by iterative tuning and optimization of the ALPR algorithms. The substantial time and effort required to train and optimize the ALPR system can result in excessive operational cost and overhead. In this paper we propose a framework to create an artificial set of license plate images for accelerated training and optimization of ALPR algorithms. The framework comprises two steps: the synthesis of license plate images according to the design and layout for a jurisdiction of interest; and the modeling of imaging transformations and distortions typically encountered in the image capture process. Distortion parameters are estimated by measurements of real plate images. The simulation methodology is successfully demonstrated for training of OCR.

  4. Multimodal Approach for Automatic Emotion Recognition Applied to the Tension Levels Study in TV Newscasts

    OpenAIRE

    Moisés Henrique Ramos Pereira; Flávio Luis Cardeal Pádua; Giani David Silva

    2015-01-01

    This article addresses a multimodal approach to automatic emotion recognition in participants of TV newscasts (presenters, reporters, commentators and others) able to assist the tension levels study in narratives of events in this television genre. The methodology applies state-of-the-art computational methods to process and analyze facial expressions, as well as speech signals. The proposed approach contributes to semiodiscoursive study of TV newscasts and their enunciative praxis, assisting...

  5. Speech Recognition Method Based on Multilayer Chaotic Neural Network

    Institute of Scientific and Technical Information of China (English)

    REN Xiaolin; HU Guangrui

    2001-01-01

    In this paper,speech recognitionusing neural networks is investigated.Especially,chaotic dynamics is introduced to neurons,and a mul-tilayer chaotic neural network (MLCNN) architectureis built.A learning algorithm is also derived to trainthe weights of the network.We apply the MLCNNto speech recognition and compare the performanceof the network with those of recurrent neural net-work (RNN) and time-delay neural network (TDNN).Experimental results show that the MLCNN methodoutperforms the other neural networks methods withrespect to average recognition rate.

  6. Integration of Metamodel and Acoustic Model for Dysarthric Speech Recognition

    Directory of Open Access Journals (Sweden)

    Hironori Matsumasa

    2009-08-01

    Full Text Available We investigated the speech recognition of a person with articulation disorders resulting from athetoid cerebral palsy. The articulation of the first words spoken tends to be unstable due to the strain placed on the speech-related muscles, and this causes degradation of speech recognition. Therefore, we proposed a robust feature extraction method based on PCA (Principal Component Analysis instead of MFCC, where the main stable utterance element is projected onto low-order features and fluctuation elements of speech style are projected onto high-order features. Therefore, the PCA-based filter will be able to extract stable utterance features only. The fluctuation of speaking style may invoke phone fluctuations, such as substitutions, deletions and insertions. In this paper, we discuss our effort to integrate a Metamodel and an Acoustic model approach. Metamodels have a technique for incorporating a model of a speaker’s confusion matrix into the ASR process in such a way as to increase recognition accuracy. The integration of metamodels and acoustic models enables fluctuation suppression not only in feature extraction but also in recognition. The proposed method resulted in an improvement of 9.9% (from 79.1% to 89% in the recognition rate compared to the conventional method.

  7. Integrating HMM-Based Speech Recognition With Direct Manipulation In A Multimodal Korean Natural Language Interface

    CERN Document Server

    Lee, G; Kim, S; Lee, Geunbae; Lee, Jong-Hyeok; Kim, Sangeok

    1996-01-01

    This paper presents a HMM-based speech recognition engine and its integration into direct manipulation interfaces for Korean document editor. Speech recognition can reduce typical tedious and repetitive actions which are inevitable in standard GUIs (graphic user interfaces). Our system consists of general speech recognition engine called ABrain {Auditory Brain} and speech commandable document editor called SHE {Simple Hearing Editor}. ABrain is a phoneme-based speech recognition engine which shows up to 97% of discrete command recognition rate. SHE is a EuroBridge widget-based document editor that supports speech commands as well as direct manipulation interfaces.

  8. Automatic Speech Signal Analysis for Clinical Diagnosis and Assessment of Speech Disorders

    CERN Document Server

    Baghai-Ravary, Ladan

    2013-01-01

    Automatic Speech Signal Analysis for Clinical Diagnosis and Assessment of Speech Disorders provides a survey of methods designed to aid clinicians in the diagnosis and monitoring of speech disorders such as dysarthria and dyspraxia, with an emphasis on the signal processing techniques, statistical validity of the results presented in the literature, and the appropriateness of methods that do not require specialized equipment, rigorously controlled recording procedures or highly skilled personnel to interpret results. Such techniques offer the promise of a simple and cost-effective, yet objective, assessment of a range of medical conditions, which would be of great value to clinicians. The ideal scenario would begin with the collection of examples of the clients’ speech, either over the phone or using portable recording devices operated by non-specialist nursing staff. The recordings could then be analyzed initially to aid diagnosis of conditions, and subsequently to monitor the clients’ progress and res...

  9. Automatic audiovisual integration in speech perception.

    Science.gov (United States)

    Gentilucci, Maurizio; Cattaneo, Luigi

    2005-11-01

    Two experiments aimed to determine whether features of both the visual and acoustical inputs are always merged into the perceived representation of speech and whether this audiovisual integration is based on either cross-modal binding functions or on imitation. In a McGurk paradigm, observers were required to repeat aloud a string of phonemes uttered by an actor (acoustical presentation of phonemic string) whose mouth, in contrast, mimicked pronunciation of a different string (visual presentation). In a control experiment participants read the same printed strings of letters. This condition aimed to analyze the pattern of voice and the lip kinematics controlling for imitation. In the control experiment and in the congruent audiovisual presentation, i.e. when the articulation mouth gestures were congruent with the emission of the string of phones, the voice spectrum and the lip kinematics varied according to the pronounced strings of phonemes. In the McGurk paradigm the participants were unaware of the incongruence between visual and acoustical stimuli. The acoustical analysis of the participants' spoken responses showed three distinct patterns: the fusion of the two stimuli (the McGurk effect), repetition of the acoustically presented string of phonemes, and, less frequently, of the string of phonemes corresponding to the mouth gestures mimicked by the actor. However, the analysis of the latter two responses showed that the formant 2 of the participants' voice spectra always differed from the value recorded in the congruent audiovisual presentation. It approached the value of the formant 2 of the string of phonemes presented in the other modality, which was apparently ignored. The lip kinematics of the participants repeating the string of phonemes acoustically presented were influenced by the observation of the lip movements mimicked by the actor, but only when pronouncing a labial consonant. The data are discussed in favor of the hypothesis that features of both

  10. Speech recognition: Acoustic-phonetic knowledge acquisition and representation

    Science.gov (United States)

    Zue, Victor W.

    1988-09-01

    The long-term research goal is to develop and implement speaker-independent continuous speech recognition systems. It is believed that the proper utilization of speech-specific knowledge is essential for such advanced systems. This research is thus directed toward the acquisition, quantification, and representation, of acoustic-phonetic and lexical knowledge, and the application of this knowledge to speech recognition algorithms. In addition, we are exploring new speech recognition alternatives based on artificial intelligence and connectionist techniques. We developed a statistical model for predicting the acoustic realization of stop consonants in various positions in the syllable template. A unification-based grammatical formalism was developed for incorporating this model into the lexical access algorithm. We provided an information-theoretic justification for the hierarchical structure of the syllable template. We analyzed segmented duration for vowels and fricatives in continuous speech. Based on contextual information, we developed durational models for vowels and fricatives that account for over 70 percent of the variance, using data from multiple, unknown speakers. We rigorously evaluated the ability of human spectrogram readers to identify stop consonants spoken by many talkers and in a variety of phonetic contexts. Incorporating the declarative knowledge used by the readers, we developed a knowledge-based system for stop identification. We achieved comparable system performance to that to the readers.

  11. Improving user-friendliness by using visually supported speech recognition

    NARCIS (Netherlands)

    Waals, J.A.J.S.; Kooi, F.L.; Kriekaard, J.J.

    2002-01-01

    While speech recognition in principle may be one of the most natural interfaces, in practice it is not due to the lack of user-friendliness. Words are regularly interpreted wrong, and subjects tend to articulate in an exaggerated manner. We explored the potential of visually supported error correcti

  12. Speech emotion recognition based on statistical pitch model

    Institute of Scientific and Technical Information of China (English)

    WANG Zhiping; ZHAO Li; ZOU Cairong

    2006-01-01

    A modified Parzen-window method, which keep high resolution in low frequencies and keep smoothness in high frequencies, is proposed to obtain statistical model. Then, a gender classification method utilizing the statistical model is proposed, which have a 98% accuracy of gender classification while long sentence is dealt with. By separation the male voice and female voice, the mean and standard deviation of speech training samples with different emotion are used to create the corresponding emotion models. Then the Bhattacharyya distance between the test sample and statistical models of pitch, are utilized for emotion recognition in speech.The normalization of pitch for the male voice and female voice are also considered, in order to illustrate them into a uniform space. Finally, the speech emotion recognition experiment based on K Nearest Neighbor shows that, the correct rate of 81% is achieved, where it is only 73.85%if the traditional parameters are utilized.

  13. EMOTIONAL SPEECH RECOGNITION BASED ON SVM WITH GMM SUPERVECTOR

    Institute of Scientific and Technical Information of China (English)

    Chen Yanxiang; Xie Jian

    2012-01-01

    Emotion recognition from speech is an important field of research in human computer interaction.In this letter the framework of Support Vector Machines (SVM) with Gaussian Mixture Model (GMM) supervector is introduced for emotional speech recognition.Because of the importance of variance in reflecting the distribution of speech,the normalized mean vectors potential to exploit the information from the variance are adopted to form the GMM supervector.Comparative experiments from five aspects are conducted to study their corresponding effect to system performance.The experiment results,which indicate that the influence of number of mixtures is strong as well as influence of duration is weak,provide basis for the train set selection of Universal Background Model (UBM).

  14. Automatic target recognition apparatus and method

    Energy Technology Data Exchange (ETDEWEB)

    Baumgart, Chris W. (Santa Fe, NM); Ciarcia, Christopher A. (Los Alamos, NM)

    2000-01-01

    An automatic target recognition apparatus (10) is provided, having a video camera/digitizer (12) for producing a digitized image signal (20) representing an image containing therein objects which objects are to be recognized if they meet predefined criteria. The digitized image signal (20) is processed within a video analysis subroutine (22) residing in a computer (14) in a plurality of parallel analysis chains such that the objects are presumed to be lighter in shading than the background in the image in three of the chains and further such that the objects are presumed to be darker than the background in the other three chains. In two of the chains the objects are defined by surface texture analysis using texture filter operations. In another two of the chains the objects are defined by background subtraction operations. In yet another two of the chains the objects are defined by edge enhancement processes. In each of the analysis chains a calculation operation independently determines an error factor relating to the probability that the objects are of the type which should be recognized, and a probability calculation operation combines the results of the analysis chains.

  15. POLISH EMOTIONAL SPEECH RECOGNITION USING ARTIFICAL NEURAL NETWORK

    Directory of Open Access Journals (Sweden)

    Paweł Powroźnik

    2014-11-01

    Full Text Available The article presents the issue of emotion recognition based on polish emotional speech analysis. The Polish database of emotional speech, prepared and shared by the Medical Electronics Division of the Lodz University of Technology, has been used for research. The following parameters extracted from sampled and normalised speech signal has been used for the analysis: energy of signal, speaker’s sex, average value of speech signal and both the minimum and maximum sample value for a given signal. As an emotional state a classifier fof our layers of artificial neural network has been used. The achieved results reach 50% of accuracy. Conducted researches focused on six emotional states: a neutral state, sadness, joy, anger, fear and boredom.

  16. Temporal visual cues aid speech recognition

    DEFF Research Database (Denmark)

    Zhou, Xiang; Ross, Lars; Lehn-Schiøler, Tue;

    2006-01-01

    temporal synchronicity of the visual input that aids parsing of the auditory stream. More specifically, we expected that purely temporal information, which does not convey information such as place of articulation may facility word recognition. METHODS: To test this prediction we used temporal features of......BACKGROUND: It is well known that under noisy conditions, viewing a speaker's articulatory movement aids the recognition of spoken words. Conventionally it is thought that the visual input disambiguates otherwise confusing auditory input. HYPOTHESIS: In contrast we hypothesize that it is the...... audio to generate an artificial talking-face video and measured word recognition performance on simple monosyllabic words. RESULTS: When presenting words together with the artificial video we find that word recognition is improved over purely auditory presentation. The effect is significant (p...

  17. Biologically inspired emotion recognition from speech

    OpenAIRE

    Buscicchio Cosimo; Caponetti Laura; Castellano Giovanna

    2011-01-01

    Abstract Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM) recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency ceps...

  18. Computer recognition of phonets in speech

    Science.gov (United States)

    Martin, D. D.

    1982-12-01

    This project generated phonetic units, termed 'phonets,' from digitized speech files. The time file was converted to feature space using the Fourier Transform, and phonet occurrences were detected using Minkowski One and Two distance measures. Phonet matches were detected and ranked for each phonet compared against a template file. Phonet short-time energy was included in the output files. An algorithm was developed to partition feature space and its performance was evaluated.

  19. Phonetic recognition of natural speech by nonstationary Markov models

    Science.gov (United States)

    Falaschi, Alessandro

    1988-04-01

    A speech recognition system based on statistical decision theory, viewing the problem as the classical design of a decoder in a communication system framework is outlined. Statistical properties of the language are used to characterize the allowable phonetic sequence inside the words, while trying to capture allophonic phoneme features into functional-dependent acoustical models with the aim of utilizing them as word segmentation cues. Experiments prove the utility of an explicit modeling of the intrinsic speech nonstationarity in a statistically based speech recognition system. The nonstationarity of phonetic chain statistics and acoustical transition probabilities can be easily taken into account, yielding recognition improvements. The use of inside syllable position dependent phonetic models does not improve recognition performance, and the iterative Viterbi training algorithm seems unable to adequately valorize this kind of acoustical modeling. As a direct consequence of the system design, the recognized phonetic sequence exhibits word boundary marks even in absence of pauses between words, thus giving anchor points to the higher level parsing algorithms needed in a complete recognition system.

  20. Environment-dependent denoising autoencoder for distant-talking speech recognition

    Science.gov (United States)

    Ueda, Yuma; Wang, Longbiao; Kai, Atsuhiko; Ren, Bo

    2015-12-01

    In this paper, we propose an environment-dependent denoising autoencoder (DAE) and automatic environment identification based on a deep neural network (DNN) with blind reverberation estimation for robust distant-talking speech recognition. Recently, DAEs have been shown to be effective in many noise reduction and reverberation suppression applications because higher-level representations and increased flexibility of the feature mapping function can be learned. However, a DAE is not adequate in mismatched training and test environments. In a conventional DAE, parameters are trained using pairs of reverberant speech and clean speech under various acoustic conditions (that is, an environment-independent DAE). To address the above problem, we propose two environment-dependent DAEs to reduce the influence of mismatches between training and test environments. In the first approach, we train various DAEs using speech from different acoustic environments, and the DAE for the condition that best matches the test condition is automatically selected (that is, a two-step environment-dependent DAE). To improve environment identification performance, we propose a DNN that uses both reverberant speech and estimated reverberation. In the second approach, we add estimated reverberation features to the input of the DAE (that is, a one-step environment-dependent DAE or a reverberation-aware DAE). The proposed method is evaluated using speech in simulated and real reverberant environments. Experimental results show that the environment-dependent DAE outperforms the environment-independent one in both simulated and real reverberant environments. For two-step environment-dependent DAE, the performance of environment identification based on the proposed DNN approach is also better than that of the conventional DNN approach, in which only reverberant speech is used and reverberation is not blindly estimated. And, the one-step environment-dependent DAE significantly outperforms the two

  1. Adaptive Recognition of Phonemes from Speaker - Connected-Speech Using Alisa.

    Science.gov (United States)

    Osella, Stephen Albert

    The purpose of this dissertation research is to investigate a novel approach to automatic speech recognition (ASR). The successes that have been achieved in ASR have relied heavily on the use of a language grammar, which significantly constrains the ASR process. By using grammar to provide most of the recognition ability, the ASR system does not have to be as accurate at the low-level recognition stage. The ALISA Phonetic Transcriber (APT) algorithm is proposed as a way to improve ASR by enhancing the lowest -level recognition stage. The objective of the APT algorithm is to classify speech frames (a short sequence of speech signal samples) into a small set of phoneme classes. The APT algorithm constructs the mapping from speech frames to phoneme labels through a multi-layer feedforward process. A design principle of APT is that final decisions are delayed as long as possible. Instead of attempting to optimize the decision making at each processing level individually, each level generates a list of candidate solutions that are passed on to the next level of processing. The later processing levels use these candidate solutions to resolve ambiguities. The scope of this dissertation is the design of the APT algorithm up to the speech-frame classification stage. In future research, the APT algorithm will be extended to the word recognition stage. In particular, the APT algorithm could serve as the front-end stage to a Hidden Markov Model (HMM) based word recognition system. In such a configuration, the APT algorithm would provide the HMM with the requisite phoneme state-probability estimates. To date, the APT algorithm has been tested with the TIMIT and NTIMIT speech databases. The APT algorithm has been trained and tested on the SX and SI sentence texts using both male and female speakers. Results indicate better performance than those results obtained using a neural network based speech-frame classifier. The performance of the APT algorithm has been evaluated for

  2. New Ideas for Speech Recognition and Related Technologies

    Energy Technology Data Exchange (ETDEWEB)

    Holzrichter, J F

    2002-06-17

    The ideas relating to the use of organ motion sensors for the purposes of speech recognition were first described by.the author in spring 1994. During the past year, a series of productive collaborations between the author, Tom McEwan and Larry Ng ensued and have lead to demonstrations, new sensor ideas, and algorithmic descriptions of a large number of speech recognition concepts. This document summarizes the basic concepts of recognizing speech once organ motions have been obtained. Micro power radars and their uses for the measurement of body organ motions, such as those of the heart and lungs, have been demonstrated by Tom McEwan over the past two years. McEwan and I conducted a series of experiments, using these instruments, on vocal organ motions beginning in late spring, during which we observed motions of vocal folds (i.e., cords), tongue, jaw, and related organs that are very useful for speech recognition and other purposes. These will be reviewed in a separate paper. Since late summer 1994, Lawrence Ng and I have worked to make many of the initial recognition ideas more rigorous and to investigate the applications of these new ideas to new speech recognition algorithms, to speech coding, and to speech synthesis. I introduce some of those ideas in section IV of this document, and we describe them more completely in the document following this one, UCRL-UR-120311. For the design and operation of micro-power radars and their application to body organ motions, the reader may contact Tom McEwan directly. The capability for using EM sensors (i.e., radar units) to measure body organ motions and positions has been available for decades. Impediments to their use appear to have been size, excessive power, lack of resolution, and lack of understanding of the value of organ motion measurements, especially as applied to speech related technologies. However, with the invention of very low power, portable systems as demonstrated by McEwan at LLNL researchers have begun

  3. Biologically inspired emotion recognition from speech

    Directory of Open Access Journals (Sweden)

    Buscicchio Cosimo

    2011-01-01

    Full Text Available Abstract Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  4. Biologically inspired emotion recognition from speech

    Science.gov (United States)

    Caponetti, Laura; Buscicchio, Cosimo Alessandro; Castellano, Giovanna

    2011-12-01

    Emotion recognition has become a fundamental task in human-computer interaction systems. In this article, we propose an emotion recognition approach based on biologically inspired methods. Specifically, emotion classification is performed using a long short-term memory (LSTM) recurrent neural network which is able to recognize long-range dependencies between successive temporal patterns. We propose to represent data using features derived from two different models: mel-frequency cepstral coefficients (MFCC) and the Lyon cochlear model. In the experimental phase, results obtained from the LSTM network and the two different feature sets are compared, showing that features derived from the Lyon cochlear model give better recognition results in comparison with those obtained with the traditional MFCC representation.

  5. Automatic anatomy recognition of sparse objects

    Science.gov (United States)

    Zhao, Liming; Udupa, Jayaram K.; Odhner, Dewey; Wang, Huiqian; Tong, Yubing; Torigian, Drew A.

    2015-03-01

    A general body-wide automatic anatomy recognition (AAR) methodology was proposed in our previous work based on hierarchical fuzzy models of multitudes of objects which was not tied to any specific organ system, body region, or image modality. That work revealed the challenges encountered in modeling, recognizing, and delineating sparse objects throughout the body (compared to their non-sparse counterparts) if the models are based on the object's exact geometric representations. The challenges stem mainly from the variation in sparse objects in their shape, topology, geographic layout, and relationship to other objects. That led to the idea of modeling sparse objects not from the precise geometric representations of their samples but by using a properly designed optimal super form. This paper presents the underlying improved methodology which includes 5 steps: (a) Collecting image data from a specific population group G and body region Β and delineating in these images the objects in Β to be modeled; (b) Building a super form, S-form, for each object O in Β; (c) Refining the S-form of O to construct an optimal (minimal) super form, S*-form, which constitutes the (fuzzy) model of O; (d) Recognizing objects in Β using the S*-form; (e) Defining confounding and background objects in each S*-form for each object and performing optimal delineation. Our evaluations based on 50 3D computed tomography (CT) image sets in the thorax on four sparse objects indicate that substantially improved performance (FPVF~2%, FNVF~10%, and success where the previous approach failed) can be achieved using the new approach.

  6. Speed and automaticity of word recognition - inseparable twins?

    DEFF Research Database (Denmark)

    Poulsen, Mads; Asmussen, Vibeke; Elbro, Carsten

    'Speed and automaticity' of word recognition is a standard collocation. However, it is not clear whether speed and automaticity (i.e., effortlessness) make independent contributions to reading comprehension. In theory, both speed and automaticity may save cognitive resources for comprehension...... processes. Hence, the aim of the present study was to assess the unique contributions of word recognition speed and automaticity to reading comprehension while controlling for decoding speed and accuracy. Method: 139 Grade 5 students completed tests of reading comprehension and computer-based tests of speed...... shared developmental sources. However, multiple regression analyses indicated that both automaticity (effortlessness) and speed of word recognition (word-specific orthographic knowledge) contributed unique variance to reading comprehension when word decoding accuracy and speed was controlled. Conclusion...

  7. Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA

    OpenAIRE

    Veton Z. Këpuska; Mohamed M. Eljhani; Brian H. Hight

    2014-01-01

    A typical speech recognition system is push-to-talk operated that requires activation. However for those who use hands-busy applications, movement may by restricted or impossible. One alternative is to use Speech-Only Interface. The proposed method that is called Wake-Up-Word Speech Recognition (WUW-SR) that utilizes speech only interface. A WUW-SR system would allow the user to activate systems (Cell phone, Computer, etc.) with only speech commands instead of manual activation. T...

  8. EMOTION RECOGNITION FROM SPEECH SIGNAL: REALIZATION AND AVAILABLE TECHNIQUES

    Directory of Open Access Journals (Sweden)

    NILIM JYOTI GOGOI

    2014-05-01

    Full Text Available The ability to detect human emotion from their speech is going to be a great addition in the field of human-robot interaction. The aim of the work is to build an emotion recognition system using Mel-frequency cepstral coefficients (MFCC and Gaussian mixture model (GMM classifier. Basically the purpose of the work is aimed at describing the best possible and available methods for recognizing emotion from an emotional speech. For that reason already existing techniques and used methods for feature extraction and pattern classification have been reviewed and discussed in this paper.

  9. Statistical pattern recognition for automatic writer identification and verification

    NARCIS (Netherlands)

    Bulacu, Marius Lucian

    2007-01-01

    The thesis addresses the problem of automatic person identification using scanned images of handwriting.Identifying the author of a handwritten sample using automatic image-based methods is an interesting pattern recognition problem with direct applicability in the forensic and historic document ana

  10. Computer-based automatic finger- and speech-tracking system.

    Science.gov (United States)

    Breidegard, Björn

    2007-11-01

    This article presents the first technology ever for online registration and interactive and automatic analysis of finger movements during tactile reading (Braille and tactile pictures). Interactive software has been developed for registration (with two cameras and a microphone), MPEG-2 video compression and storage on disk or DVD as well as an interactive analysis program to aid human analysis. An automatic finger-tracking system has been implemented which also semiautomatically tracks the reading aloud speech on the syllable level. This set of tools opens the way for large scale studies of blind people reading Braille or tactile images. It has been tested in a pilot project involving congenitally blind subjects reading texts and pictures. PMID:18183897

  11. Automatic local Gabor Features extraction for face recognition

    CERN Document Server

    Jemaa, Yousra Ben

    2009-01-01

    We present in this paper a biometric system of face detection and recognition in color images. The face detection technique is based on skin color information and fuzzy classification. A new algorithm is proposed in order to detect automatically face features (eyes, mouth and nose) and extract their correspondent geometrical points. These fiducial points are described by sets of wavelet components which are used for recognition. To achieve the face recognition, we use neural networks and we study its performances for different inputs. We compare the two types of features used for recognition: geometric distances and Gabor coefficients which can be used either independently or jointly. This comparison shows that Gabor coefficients are more powerful than geometric distances. We show with experimental results how the importance recognition ratio makes our system an effective tool for automatic face detection and recognition.

  12. Using vector Taylor series with noise clustering for speech recognition in non-stationary noisy environments

    Institute of Scientific and Technical Information of China (English)

    2006-01-01

    The performance of automatic speech recognizer degrades seriously when there are mismatches between the training and testing conditions. Vector Taylor Series (VTS) approach has been used to compensate mismatches caused by additive noise and convolutive channel distortion in the cepstral domain. In this paper, the conventional VTS is extended by incorporating noise clustering into its EM iteration procedure, improving its compensation effectiveness under non-stationary noisy environments. Recognition experiments under babble and exhibition noisy environments demonstrate that the new algorithm achieves35 % average error rate reduction compared with the conventional VTS.

  13. Speech signal recognition with the homotopic representation method

    Science.gov (United States)

    Bianchi, F.; Pocci, P.; Prina-Ricotti, L.

    1981-02-01

    Speech recognition by a computer, using homotopic representation, is introduced, including the algorithm and the processing mode for the speech signal, the result of a vowel recognition experiment, and the result of a phonetic transcription experiment with simple words composed of four phonemes. The signal is stored in a delay line of M elements and N = M + 1 outputs. Homotopic defines a pair of outputs symmetrical to the exit located in the central element of the delay line. When the products of sample homotopic outputs of the first sequence of pair sampling are found, they are separately summed to the products of the following processing. This procedure is repeated continuously so that at every instant the transform function is the result of the last processing and the weighted sum of the previous result. In tests a female /o/ is recognized as /a/. Of 320 test phonemes, 15 are mistaken and 7 are dubious.

  14. Drawing Recognition for Automatic Dimensioning of Shear-Walls

    Institute of Scientific and Technical Information of China (English)

    任爱珠; 喻强; 许云

    2002-01-01

    In computer-aided structural design, the drawing of shear-walls cannot be easily automated; however, dimensioning of the shear-walls provides a method to automate the drawing. This paper presents a drawing recognition method for automatic dimensioning of shear-walls. The regional relationship method includes a graphic shape template library that can learn new shear-wall shapes. The automatic dimensioning of shear-walls is then realized by matching the templates. The regional relationship method for graph recognition effectively describes the topological relationships for graphs to significantly increase the recognition efficiency.

  15. Part-of-Speech Enhanced Context Recognition

    DEFF Research Database (Denmark)

    Madsen, Rasmus Elsborg; Larsen, Jan; Hansen, Lars Kai

    2004-01-01

    Language independent `bag-of-words' representations are surprisingly efective for text classi¯cation. In this communi- cation our aim is to elucidate the synergy between language inde- pendent features and simple language model features. We consider term tag features estimated by a so-called part...... probabilistic neural network classi- fier. Three medium size data-sets are analyzed and we find consis- tent synergy between the term and natural language features in all three sets for a range of training set sizes. The most significant en- hancement is found for small text databases where high recognition...

  16. Initial evaluation of a continuous speech recognition program for radiology

    OpenAIRE

    Kanal, KM; Hangiandreou, NJ; Sykes, AM; Eklund, HE; Araoz, PA; Leon, JA; Erickson, BJ

    2001-01-01

    The aims of this work were to measure the accuracy of one continuous speech recognition product and dependence on the speaker's gender and status as a native or nonnative English speaker, and evaluate the product's potential for routine use in transcribing radiology reports. IBM MedSpeak/Radiology software, version 1.1 was evaluated by 6 speakers. Two were nonnative English speakers, and 3 were men. Each speaker dictated a set of 12 reports. The reports included neurologic and body imaging ex...

  17. On the Generalization of Shannon Entropy for Speech Recognition

    OpenAIRE

    Obin, Nicolas; Liuni, Marco

    2012-01-01

    This paper introduces an entropy-based spectral representation as a measure of the degree of noisiness in audio signals, complementary to the standard MFCCs for audio and speech recognition. The proposed representation is based on the Rényi entropy, which is a generalization of the Shannon entropy. In audio signal representation, Rényi entropy presents the advantage of focusing either on the harmonic content (prominent amplitude within a distribution) or on the noise content (equal distributi...

  18. Syntactic error modeling and scoring normalization in speech recognition: Error modeling and scoring normalization in the speech recognition task for adult literacy training

    Science.gov (United States)

    Olorenshaw, Lex; Trawick, David

    1991-01-01

    The purpose was to develop a speech recognition system to be able to detect speech which is pronounced incorrectly, given that the text of the spoken speech is known to the recognizer. Better mechanisms are provided for using speech recognition in a literacy tutor application. Using a combination of scoring normalization techniques and cheater-mode decoding, a reasonable acceptance/rejection threshold was provided. In continuous speech, the system was tested to be able to provide above 80 pct. correct acceptance of words, while correctly rejecting over 80 pct. of incorrectly pronounced words.

  19. Statistical modeling of speech Poincaré sections in combination of frequency analysis to improve speech recognition performance.

    Science.gov (United States)

    Jafari, Ayyoob; Almasganj, Farshad; Bidhendi, Maryam Nabi

    2010-09-01

    This paper introduces a combinational feature extraction approach to improve speech recognition systems. The main idea is to simultaneously benefit from some features obtained from Poincaré section applied to speech reconstructed phase space (RPS) and typical Mel frequency cepstral coefficients (MFCCs) which have a proved role in speech recognition field. With an appropriate dimension, the reconstructed phase space of speech signal is assured to be topologically equivalent to the dynamics of the speech production system, and could therefore include information that may be absent in linear analysis approaches. Moreover, complicated systems such as speech production system can present cyclic and oscillatory patterns and Poincaré sections could be used as an effective tool in analysis of such trajectories. In this research, a statistical modeling approach based on Gaussian mixture models (GMMs) is applied to Poincaré sections of speech RPS. A final pruned feature set is obtained by applying an efficient feature selection approach to the combination of the parameters of the GMM model and MFCC-based features. A hidden Markov model-based speech recognition system and TIMIT speech database are used to evaluate the performance of the proposed feature set by conducting isolated and continuous speech recognition experiments. By the proposed feature set, 5.7% absolute isolated phoneme recognition improvement is obtained against only MFCC-based features. PMID:20887046

  20. An automatic system for Turkish word recognition using Discrete Wavelet Neural Network based on adaptive entropy

    International Nuclear Information System (INIS)

    In this paper, an automatic system is presented for word recognition using real Turkish word signals. This paper especially deals with combination of the feature extraction and classification from real Turkish word signals. A Discrete Wavelet Neural Network (DWNN) model is used, which consists of two layers: discrete wavelet layer and multi-layer perceptron. The discrete wavelet layer is used for adaptive feature extraction in the time-frequency domain and is composed of Discrete Wavelet Transform (DWT) and wavelet entropy. The multi-layer perceptron used for classification is a feed-forward neural network. The performance of the used system is evaluated by using noisy Turkish word signals. Test results showing the effectiveness of the proposed automatic system are presented in this paper. The rate of correct recognition is about 92.5% for the sample speech signals. (author)

  1. Emerging technologies with potential for objectively evaluating speech recognition skills.

    Science.gov (United States)

    Rawool, Vishakha Waman

    2016-02-01

    Work-related exposure to noise and other ototoxins can cause damage to the cochlea, synapses between the inner hair cells, the auditory nerve fibers, and higher auditory pathways, leading to difficulties in recognizing speech. Procedures designed to determine speech recognition scores (SRS) in an objective manner can be helpful in disability compensation cases where the worker claims to have poor speech perception due to exposure to noise or ototoxins. Such measures can also be helpful in determining SRS in individuals who cannot provide reliable responses to speech stimuli, including patients with Alzheimer's disease, traumatic brain injuries, and infants with and without hearing loss. Cost-effective neural monitoring hardware and software is being rapidly refined due to the high demand for neurogaming (games involving the use of brain-computer interfaces), health, and other applications. More specifically, two related advances in neuro-technology include relative ease in recording neural activity and availability of sophisticated analysing techniques. These techniques are reviewed in the current article and their applications for developing objective SRS procedures are proposed. Issues related to neuroaudioethics (ethics related to collection of neural data evoked by auditory stimuli including speech) and neurosecurity (preservation of a person's neural mechanisms and free will) are also discussed. PMID:26807789

  2. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise.

    Science.gov (United States)

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of "verbal abilities" that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18-35 years) and 22 older (60-78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults' poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access; with an

  3. Age-Related Differences in Lexical Access Relate to Speech Recognition in Noise

    Science.gov (United States)

    Carroll, Rebecca; Warzybok, Anna; Kollmeier, Birger; Ruigendijk, Esther

    2016-01-01

    Vocabulary size has been suggested as a useful measure of “verbal abilities” that correlates with speech recognition scores. Knowing more words is linked to better speech recognition. How vocabulary knowledge translates to general speech recognition mechanisms, how these mechanisms relate to offline speech recognition scores, and how they may be modulated by acoustical distortion or age, is less clear. Age-related differences in linguistic measures may predict age-related differences in speech recognition in noise performance. We hypothesized that speech recognition performance can be predicted by the efficiency of lexical access, which refers to the speed with which a given word can be searched and accessed relative to the size of the mental lexicon. We tested speech recognition in a clinical German sentence-in-noise test at two signal-to-noise ratios (SNRs), in 22 younger (18–35 years) and 22 older (60–78 years) listeners with normal hearing. We also assessed receptive vocabulary, lexical access time, verbal working memory, and hearing thresholds as measures of individual differences. Age group, SNR level, vocabulary size, and lexical access time were significant predictors of individual speech recognition scores, but working memory and hearing threshold were not. Interestingly, longer accessing times were correlated with better speech recognition scores. Hierarchical regression models for each subset of age group and SNR showed very similar patterns: the combination of vocabulary size and lexical access time contributed most to speech recognition performance; only for the younger group at the better SNR (yielding about 85% correct speech recognition) did vocabulary size alone predict performance. Our data suggest that successful speech recognition in noise is mainly modulated by the efficiency of lexical access. This suggests that older adults’ poorer performance in the speech recognition task may have arisen from reduced efficiency in lexical access

  4. Automatic Facial Expression Recognition Based on Hybrid Approach

    Directory of Open Access Journals (Sweden)

    Ali K. K. Bermani

    2012-12-01

    Full Text Available The topic of automatic recognition of facial expressions deduce a lot of researchers in the late last century and has increased a great interest in the past few years. Several techniques have emerged in order to improve the efficiency of the recognition by addressing problems in face detection and extraction features in recognizing expressions. This paper has proposed automatic system for facial expression recognition which consists of hybrid approach in feature extraction phase which represent a combination between holistic and analytic approaches by extract 307 facial expression features (19 features by geometric, 288 feature by appearance. Expressions recognition is performed by using radial basis function (RBF based on artificial neural network to recognize the six basic emotions (anger, fear, disgust, happiness, surprise, sadness in addition to the natural.The system achieved recognition rate 97.08% when applying on person-dependent database and 93.98% when applying on person-independent.

  5. Event-Synchronous Analysis for Connected-Speech Recognition.

    Science.gov (United States)

    Morgan, David Peter

    The motivation for event-synchronous speech analysis originates from linear system theory where the speech-source transfer function is excited by an impulse-like driving function. In speech processing, the impulse response obtained from this linear system contains both semantic information and the vocal tract transfer function. Typically, an estimate of the transfer function is obtained via the spectrum by assuming a short-time stationary signal within some analysis window. However, this spectrum is often distorted by the periodic effects which occur when multiple (pitch) impulses are included in the analysis window. One method to remove these effects would be to deconvolve the excitation function from the speech signal to obtain the transfer function. The more attractive approach is to locate and identify the excitation function and synchronize the analysis frame with it. Event-synchronous analysis differs from pitch -synchronous analysis in that there are many events useful for speech recognition which are not pitch excited. In addition, event-synchronous analysis locates the important boundaries between speech events, such as voiced to unvoiced and silence to burst transitions. In asynchronous processing, an analysis frame which contains portions of two adjacent but dissimilar speech events is often so ambiguous as to distort or mask the important "phonetic" features of both events. Thus event-syncronous processing is employed to obtain an accurate spectral estimate and in turn enhance the estimate of the vocal-tract transfer function. Among the issues which have been addressed in implementing an event-synchronous recognition system are those of developing robust event (pitch, burst, etc.) detectors, synchronous-analysis methodologies, more meaningful feature sets, and dynamic programming algorithms for nonlinear time alignment. An advantage of event-synchronous processing is that the improved representation of the transfer function creates an opportunity for

  6. Error analysis to improve the speech recognition accuracy on Telugu language

    Indian Academy of Sciences (India)

    N Usha Rani; P N Girija

    2012-12-01

    Speech is one of the most important communication channels among the people. Speech Recognition occupies a prominent place in communication between the humans and machine. Several factors affect the accuracy of the speech recognition system. Much effort was involved to increase the accuracy of the speech recognition system, still erroneous output is generating in current speech recognition systems. Telugu language is one of the most widely spoken south Indian languages. In the proposed Telugu speech recognition system, errors obtained from decoder are analysed to improve the performance of the speech recognition system. Static pronunciation dictionary plays a key role in the speech recognition accuracy. Modification should be performed in the dictionary, which is used in the decoder of the speech recognition system. This modification reduces the number of the confusion pairs which improves the performance of the speech recognition system. Language model scores are also varied with this modification. Hit rate is considerably increased during this modification and false alarms have been changing during the modification of the pronunciation dictionary. Variations are observed in different error measures such as F-measures, error-rate and Word Error Rate (WER) by application of the proposed method.

  7. Improving on hidden Markov models: An articulatorily constrained, maximum likelihood approach to speech recognition and speech coding

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1996-11-05

    The goal of the proposed research is to test a statistical model of speech recognition that incorporates the knowledge that speech is produced by relatively slow motions of the tongue, lips, and other speech articulators. This model is called Maximum Likelihood Continuity Mapping (Malcom). Many speech researchers believe that by using constraints imposed by articulator motions, we can improve or replace the current hidden Markov model based speech recognition algorithms. Unfortunately, previous efforts to incorporate information about articulation into speech recognition algorithms have suffered because (1) slight inaccuracies in our knowledge or the formulation of our knowledge about articulation may decrease recognition performance, (2) small changes in the assumptions underlying models of speech production can lead to large changes in the speech derived from the models, and (3) collecting measurements of human articulator positions in sufficient quantity for training a speech recognition algorithm is still impractical. The most interesting (and in fact, unique) quality of Malcom is that, even though Malcom makes use of a mapping between acoustics and articulation, Malcom can be trained to recognize speech using only acoustic data. By learning the mapping between acoustics and articulation using only acoustic data, Malcom avoids the difficulties involved in collecting articulator position measurements and does not require an articulatory synthesizer model to estimate the mapping between vocal tract shapes and speech acoustics. Preliminary experiments that demonstrate that Malcom can learn the mapping between acoustics and articulation are discussed. Potential applications of Malcom aside from speech recognition are also discussed. Finally, specific deliverables resulting from the proposed research are described.

  8. Robust Speech Recognition Using Factorial HMMs for Home Environments

    Directory of Open Access Journals (Sweden)

    Sadaoki Furui

    2007-01-01

    Full Text Available We focus on the problem of speech recognition in the presence of nonstationary sudden noise, which is very likely to happen in home environments. As a model compensation method for this problem, we investigated the use of factorial hidden Markov model (FHMM architecture developed from a clean-speech hidden Markov model (HMM and a sudden-noise HMM. While in conventional studies this architecture is defined only for static features of the observation vector, we extended it to dynamic features. In addition, we performed home-environment adaptation of FHMMs to the characteristics of a given house. A database recorded by a personal robot called PaPeRo in home environments was used for the evaluation of the proposed method. Isolated word recognition experiments demonstrated the effectiveness of the proposed method under noisy conditions. Home-dependent word FHMMs (HD-FHMMs reduced the word error rate by 20.5% compared to that of the clean-speech word HMMs.

  9. Robust Speech Recognition Using Factorial HMMs for Home Environments

    Directory of Open Access Journals (Sweden)

    Betkowska Agnieszka

    2007-01-01

    Full Text Available We focus on the problem of speech recognition in the presence of nonstationary sudden noise, which is very likely to happen in home environments. As a model compensation method for this problem, we investigated the use of factorial hidden Markov model (FHMM architecture developed from a clean-speech hidden Markov model (HMM and a sudden-noise HMM. While in conventional studies this architecture is defined only for static features of the observation vector, we extended it to dynamic features. In addition, we performed home-environment adaptation of FHMMs to the characteristics of a given house. A database recorded by a personal robot called PaPeRo in home environments was used for the evaluation of the proposed method. Isolated word recognition experiments demonstrated the effectiveness of the proposed method under noisy conditions. Home-dependent word FHMMs (HD-FHMMs reduced the word error rate by 20.5 compared to that of the clean-speech word HMMs.

  10. RFID: A Revolution in Automatic Data Recognition

    Science.gov (United States)

    Deal, Walter F., III

    2004-01-01

    Radio frequency identification, or RFID, is a generic term for technologies that use radio waves to automatically identify people or objects. There are several methods of identification, but the most common is to store a serial number that identifies a person or object, and perhaps other information, on a microchip that is attached to an antenna…

  11. Sign language perception research for improving automatic sign language recognition

    OpenAIRE

    Ten Holt, G.A.; Arendsen, J.; De Ridder, H.; Van Doorn, A.J.; Reinders, M.J.T.; Hendriks, E.A.

    2009-01-01

    Current automatic sign language recognition (ASLR) seldom uses perceptual knowledge about the recognition of sign language. Using such knowledge can improve ASLR because it can give an indication which elements or phases of a sign are important for its meaning. Also, the current generation of data-driven ASLR methods has shortcomings which may not be solvable without the use of knowledge on human sign language processing. Handling variation in the precise execution of signs is an example of s...

  12. Experiments in Image Segmentation for Automatic US License Plate Recognition

    OpenAIRE

    Diaz Acosta, Beatriz

    2004-01-01

    License plate recognition/identification (LPR/I) applies image processing and character recognition technology to identify vehicles by automatically reading their license plates. In the United States, however, each state has its own standard-issue plates, plus several optional styles, which are referred to as special license plates or varieties. There is a clear absence of standardization and multi-colored, complex backgrounds are becoming more frequent in license plates. Commercially availab...

  13. Speech recognition in individuals having a clinical complaint about understanding speech during noise or not

    Directory of Open Access Journals (Sweden)

    Becker, Karine Thaís

    2011-07-01

    Full Text Available Introduction: Clinical and experimental study. Individuals with a normal hearing can be jeopardized in adverse communication situations, what negatively interferes with speech clearness. Objective: check and compare the performance of normal hearing young adults who have a difficulty in understanding speech during noise or not, by making use of sentences as stimuli. Method: 50 normal hearing individuals, 21 of whom were male and 29 were female, aged between 19 and 32, were evaluated and divided into two groups: with and without a clinical complaint about understanding speech during noise. By using Portuguese Sentence Lists test, the Recognition Threshold of Sentences during Noise research was performed, through which the signal-to-noise (SN ratios were obtained. The contrasting noise was introduced at 65 dB NA. Results: the average values achieved for SN ratios in the left ear, for the group without a complaint and the group with a complaint, were respectively 6.26 dB and 3.62 dB. For the left ear, the values were -7.12 dB and -4.12 dB. A statistically significant difference was noticed in both right and left ears of the two groups. Conclusion: normal hearing individuals showing a clinical complaint about understanding speech at noisy places have more difficulty in the task to recognize sentences during noise, in comparison with the people who do not face such a difficulty. Accordingly, the customary audiologic evaluation must include tests using sentences during a contrasting noise, with a view to evaluating the speech recognition performance more reliably and efficiently. ACTRN12610000822088

  14. Recognition of Emotions in German Speech Using Gaussian Mixture Models

    Czech Academy of Sciences Publication Activity Database

    Vondra, Martin; Vích, Robert

    Vol. 5398. Berlin: SPRINGER-VERLAG, 2009 - (Esposito, A.; Hussain, A.; Marinaro, M.; Martone, R.), s. 256-263. (Lecture Notes in Artificial Intelligence . 5398). ISBN 978-3-642-00524-4. ISSN 0302-9743. [euCognition International Training School on Multimodal Signals - Cognitive and Algorithmic Issues (European COST A2102). Vietri sul Mare (IT), 21.04.2008-26.04.2008] R&D Projects: GA MŠk OC08010 Institutional research plan: CEZ:AV0Z20670512 Keywords : emotion recognition * speech emotions Subject RIV: JA - Electronics ; Optoelectronics, Electrical Engineering

  15. Class of data-flow architectures for speech recognition

    Energy Technology Data Exchange (ETDEWEB)

    Bisiani, R.

    1983-01-01

    The behaviour of many speech recognition systems, and of some artificial intelligence programs, can be modeled as a data driven computation in which the instructions are complex but inexpensive operations (e.g. the evaluation of the likelihood of a partial sentence hypothesis), and the data-flow graph is derived directly from the knowledge representation (e.G. The phoneme-level network in a harpy system). The architecture of a machine that exploits this characteristic is presented, together with the results of the simulation of one possible implementation. 15 references.

  16. Real-world speech recognition with neural networks

    Science.gov (United States)

    Barnard, Etienne; Cole, Ronald; Fanty, Mark; Vermeulen, Pieter J. E.

    1995-04-01

    We describe a system based on neural networks that is designed to recognize speech transmitted through the telephone network. Context-dependent phonetic modeling is studied as a method of improving recognition accuracy, and a special training algorithm is introduced to make the training of these nets more manageable. Our system is designed for real-world applications, and we have therefore specialized our implementation for this goal; a pipelined DSP structure and a compact search algorithm are described as examples of this specialization. Preliminary results from a realistic test of the system (a field trial for the U.S. Census Bureau) are reported.

  17. Automatic Recognition of Element Classes and Boundaries in the Birdsong with Variable Sequences.

    Science.gov (United States)

    Koumura, Takuya; Okanoya, Kazuo

    2016-01-01

    Researches on sequential vocalization often require analysis of vocalizations in long continuous sounds. In such studies as developmental ones or studies across generations in which days or months of vocalizations must be analyzed, methods for automatic recognition would be strongly desired. Although methods for automatic speech recognition for application purposes have been intensively studied, blindly applying them for biological purposes may not be an optimal solution. This is because, unlike human speech recognition, analysis of sequential vocalizations often requires accurate extraction of timing information. In the present study we propose automated systems suitable for recognizing birdsong, one of the most intensively investigated sequential vocalizations, focusing on the three properties of the birdsong. First, a song is a sequence of vocal elements, called notes, which can be grouped into categories. Second, temporal structure of birdsong is precisely controlled, meaning that temporal information is important in song analysis. Finally, notes are produced according to certain probabilistic rules, which may facilitate the accurate song recognition. We divided the procedure of song recognition into three sub-steps: local classification, boundary detection, and global sequencing, each of which corresponds to each of the three properties of birdsong. We compared the performances of several different ways to arrange these three steps. As results, we demonstrated a hybrid model of a deep convolutional neural network and a hidden Markov model was effective. We propose suitable arrangements of methods according to whether accurate boundary detection is needed. Also we designed the new measure to jointly evaluate the accuracy of note classification and boundary detection. Our methods should be applicable, with small modification and tuning, to the songs in other species that hold the three properties of the sequential vocalization. PMID:27442240

  18. Radar automatic target recognition (ATR) and non-cooperative target recognition (NCTR)

    CERN Document Server

    Blacknell, David

    2013-01-01

    The ability to detect and locate targets by day or night, over wide areas, regardless of weather conditions has long made radar a key sensor in many military and civil applications. However, the ability to automatically and reliably distinguish different targets represents a difficult challenge. Radar Automatic Target Recognition (ATR) and Non-Cooperative Target Recognition (NCTR) captures material presented in the NATO SET-172 lecture series to provide an overview of the state-of-the-art and continuing challenges of radar target recognition. Topics covered include the problem as applied to th

  19. Quality Assessment of Compressed Video for Automatic License Plate Recognition

    DEFF Research Database (Denmark)

    Ukhanova, Ann; Støttrup-Andersen, Jesper; Forchhammer, Søren; Madsen, John

    2014-01-01

    Definition of video quality requirements for video surveillance poses new questions in the area of quality assessment. This paper presents a quality assessment experiment for an automatic license plate recognition scenario. We explore the influence of the compression by H.264/AVC and H.265/HEVC...

  20. Automatization and Orthographic Development in Second Language Visual Word Recognition

    Science.gov (United States)

    Kida, Shusaku

    2016-01-01

    The present study investigated second language (L2) learners' acquisition of automatic word recognition and the development of L2 orthographic representation in the mental lexicon. Participants in the study were Japanese university students enrolled in a compulsory course involving a weekly 30-minute sustained silent reading (SSR) activity with…

  1. Two Systems for Automatic Music Genre Recognition

    DEFF Research Database (Denmark)

    Sturm, Bob L.

    2012-01-01

    trials of cross-validation. Second, we test the robustness of each system to spectral equalization. Finally, we test how well human subjects recognize the genres of music excerpts composed by each system to be highly genre representative. Our results suggest that neither high-performing system has a......We re-implement and test two state-of-the-art systems for automatic music genre classification; but unlike past works in this area, we look closer than ever before at their behavior. First, we look at specific instances where each system consistently applies the same wrong label across multiple...

  2. Automatic Understanding of Spontaneous Arabic Speech --- A Numerical Model Compréhension automatique de la parole arabe spontanée --- Une modélisation numérique

    Directory of Open Access Journals (Sweden)

    Anis Zouaghi

    2009-01-01

    Full Text Available This work is part of a large research project entitled "Oreillodule" aimed at developing tools for automatic speech recognition, translation, and synthesis for Arabic language. Our attention has mainly been focused on an attempt to present the semantic analyzer developed for the automatic comprehension of the standard spontaneous arabic speech. The findings on the effectiveness of the semantic decoder are quite satisfactory.

  3. Automatic Modulation Recognition by Support Vector Machines Using Wavelet Kernel

    International Nuclear Information System (INIS)

    Automatic modulation identification plays a significant role in electronic warfare, electronic surveillance systems and electronic counter measure. The task of modulation recognition of communication signals is to determine the modulation type and signal parameters. In fact, automatic modulation identification can be range to an application of pattern recognition in communication field. The support vector machines (SVM) is a new universal learning machine which is widely used in the fields of pattern recognition, regression estimation and probability density. In this paper, a new method using wavelet kernel function was proposed, which maps the input vector xi into a high dimensional feature space F. In this feature space F, we can construct the optimal hyperplane that realizes the maximal margin in this space. That is to say, we can use SVM to classify the communication signals into two groups, namely analogue modulated signals and digitally modulated signals. In addition, computer simulation results are given at last, which show good performance of the method

  4. Efficient Speech Recognition by Using Modular Neural Network

    Directory of Open Access Journals (Sweden)

    Dr.R.L.K.Venkateswarlu

    2011-05-01

    Full Text Available The Modular approach and Neural Network approach are well known concepts in the research and engineering community. By combining these two together, the Modular Neural Network approach is very effective in searching for solutions to complex problems of various fields. The aim of this study is the distribution of the complexity for the ambiguous words classification task on a set of modules. Each of these modules is a single Neural Network which is characterized by its high degree of specialization. The number of interfaces, and there with possibilities for filtering external acoustic – phonetic knowledge, increases a modular architecture. Modular Neural Network (MNN for speech recognition is presented with speaker dependent single word recognition in this paper. Using this approach by taking computational effort into account, the system performance can be accessed. The active performance is found maximum for MFCC while training with Modular Neural Network classifiers as 99.88%. The active performance is found maximum for LPCC while training with Modular Neural Network classifier as 99.77%. It is found that MFCC performance is superior to LPCC performance while training the speech data with Modular Neural Network classifier.

  5. Modeling words with subword units in an articulatorily constrained speech recognition algorithm

    Energy Technology Data Exchange (ETDEWEB)

    Hogden, J.

    1997-11-20

    The goal of speech recognition is to find the most probable word given the acoustic evidence, i.e. a string of VQ codes or acoustic features. Speech recognition algorithms typically take advantage of the fact that the probability of a word, given a sequence of VQ codes, can be calculated.

  6. Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

    Directory of Open Access Journals (Sweden)

    Prof. Ch.Srinivasa Kumar

    2011-08-01

    Full Text Available The results of a case study carried out while developing an automatic speaker recognition system are presented in this paper. The Vector Quantization (VQ approach is used for mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword. The collection of all codewords is called a codebook. After the enrolment session, the acoustic vectors extracted from input speech of a speaker provide a set of training vectors. LBG algorithm due to Linde, Buzo and Gray is used for clustering a set of L training vectors into a set of M codebook vectors. For comparisonpurpose, the distance between each test codeword and each codeword in the master codebook is computed. The difference is used to make recognition decision. The entire coding was done in MATLAB and the system was tested for its reliability.

  7. On model architecture for a children's speech recognition interactive dialog system

    OpenAIRE

    Kraleva, Radoslava; Kralev, Velin

    2016-01-01

    This report presents a general model of the architecture of information systems for the speech recognition of children. It presents a model of the speech data stream and how it works. The result of these studies and presented veins architectural model shows that research needs to be focused on acoustic-phonetic modeling in order to improve the quality of children's speech recognition and the sustainability of the systems to noise and changes in transmission environment. Another important aspe...

  8. Multimodal Approach for Automatic Emotion Recognition Applied to the Tension Levels Study in TV Newscasts

    Directory of Open Access Journals (Sweden)

    Moisés Henrique Ramos Pereira

    2015-12-01

    Full Text Available This article addresses a multimodal approach to automatic emotion recognition in participants of TV newscasts (presenters, reporters, commentators and others able to assist the tension levels study in narratives of events in this television genre. The methodology applies state-of-the-art computational methods to process and analyze facial expressions, as well as speech signals. The proposed approach contributes to semiodiscoursive study of TV newscasts and their enunciative praxis, assisting, for example, the identification of the communication strategy of these programs. To evaluate the effectiveness of the proposed approach was applied it in a video related to a report displayed on a Brazilian TV newscast great popularity in the state of Minas Gerais. The experimental results are promising on the recognition of emotions on the facial expressions of tele journalists and are in accordance with the distribution of audiovisual indicators extracted over a TV newscast, demonstrating the potential of the approach to support the TV journalistic discourse analysis.This article addresses a multimodal approach to automatic emotion recognition in participants of TV newscasts (presenters, reporters, commentators and others able to assist the tension levels study in narratives of events in this television genre. The methodology applies state-of-the-art computational methods to process and analyze facial expressions, as well as speech signals. The proposed approach contributes to semiodiscoursive study of TV newscasts and their enunciative praxis, assisting, for example, the identification of the communication strategy of these programs. To evaluate the effectiveness of the proposed approach was applied it in a video related to a report displayed on a Brazilian TV newscast great popularity in the state of Minas Gerais. The experimental results are promising on the recognition of emotions on the facial expressions of tele journalists and are in accordance

  9. AUTOMATIC RECOGNITION OF FREIGHT CAR NUMBER

    Institute of Scientific and Technical Information of China (English)

    2000-01-01

    This paper discusses methods for character extraction on the basis of statistical and structural features of gray-level images,and proposes a dynamic local contrast threshold method accommodating to line width.Precise locating of character string is realized by exploiting horizontal projection and character arrangements of binary images in horizontal and vertical directions respectively.Also discussed is the method for segmentation of characters in binary images,which is based on projection taking stroke width and character sizes into account.A new method for character identification is explored,which is based on compound neural networks.A complex neural network consists of two sub-nets,the first sub-net performs self-association of patterns via 2-dimentional local-connected 3-order networks,the second sub-net,linking with a locally connected BP networks,performs classification.The reliability of the network recognition is reinforced by introducing conditions for identification denial.Experiments confirm that the proposed methods possess the advantages of impressive robustness,rapid processing and high accuracy of identification.

  10. Automatic recognition of printed Oriya script

    Indian Academy of Sciences (India)

    B B Chaudhuri; U Pal; M Mitra

    2002-02-01

    This paper deals with an Optical Character Recognition (OCR) system for printed Oriya script. The development of OCR for this script is difficult because a large number of character shapes in the script have to be recognized. In the proposed system, the document image is first captured using a flat-bed scanner and then passed through different preprocessing modules like skew correction, line segmentation, zone detection, word and character segmentation etc. These modules have been developed by combining some conventional techniques with some newly proposed ones. Next, individual characters are recognized using a combination of stroke and run-number based features, along with features obtained from the concept of water overflow from a reservoir. The feature detection methods are simple and robust, and do not require preprocessing steps like thinning and pruning. A prototype of the system has been tested on a variety of printed Oriya material, and currently achieves 96.3% character level accuracy on average.

  11. Variable Frame Rate and Length Analysis for Data Compression in Distributed Speech Recognition

    DEFF Research Database (Denmark)

    Kraljevski, Ivan; Tan, Zheng-Hua

    2014-01-01

    This paper addresses the issue of data compression in distributed speech recognition on the basis of a variable frame rate and length analysis method. The method first conducts frame selection by using a posteriori signal-to-noise ratio weighted energy distance to find the right time resolution...... length for steady regions. The method is applied to scalable source coding in distributed speech recognition where the target bitrate is met by adjusting the frame rate. Speech recognition results show that the proposed approach outperforms other compression methods in terms of recognition accuracy...

  12. Effective Prediction of Errors by Non-native Speakers Using Decision Tree for Speech Recognition-Based CALL System

    Science.gov (United States)

    Wang, Hongcui; Kawahara, Tatsuya

    CALL (Computer Assisted Language Learning) systems using ASR (Automatic Speech Recognition) for second language learning have received increasing interest recently. However, it still remains a challenge to achieve high speech recognition performance, including accurate detection of erroneous utterances by non-native speakers. Conventionally, possible error patterns, based on linguistic knowledge, are added to the lexicon and language model, or the ASR grammar network. However, this approach easily falls in the trade-off of coverage of errors and the increase of perplexity. To solve the problem, we propose a method based on a decision tree to learn effective prediction of errors made by non-native speakers. An experimental evaluation with a number of foreign students learning Japanese shows that the proposed method can effectively generate an ASR grammar network, given a target sentence, to achieve both better coverage of errors and smaller perplexity, resulting in significant improvement in ASR accuracy.

  13. Segment-based acoustic models for continuous speech recognition

    Science.gov (United States)

    Ostendorf, Mari; Rohlicek, J. R.

    1993-07-01

    This research aims to develop new and more accurate stochastic models for speaker-independent continuous speech recognition, by extending previous work in segment-based modeling and by introducing a new hierarchical approach to representing intra-utterance statistical dependencies. These techniques, which are more costly than traditional approaches because of the large search space associated with higher order models, are made feasible through rescoring a set of HMM-generated N-best sentence hypotheses. We expect these different modeling techniques to result in improved recognition performance over that achieved by current systems, which handle only frame-based observations and assume that these observations are independent given an underlying state sequence. In the fourth quarter of the project, we have completed the following: (1) ported our recognition system to the Wall Street Journal task, a standard task in the ARPA community; (2) developed an initial dependency-tree model of intra-utterance observation correlation; and (3) implemented baseline language model estimation software. Our initial results on the Wall Street Journal task are quite good and represent significantly improved performance over most HMM systems reporting on the Nov. 1992 5k vocabulary test set.

  14. Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition

    Directory of Open Access Journals (Sweden)

    Martin Heckmann

    2002-11-01

    Full Text Available It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI architecture. A hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next, we compare different criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of adaptive weighting are compared and discussed.

  15. Automatic Artist Recognition of Songs for Advanced Retrieval

    Institute of Scientific and Technical Information of China (English)

    ZHU Song-hao; LIU Yun-cai

    2008-01-01

    Automatic recognition of artists is very important in acoustic music indexing, browsing, and contentbased acoustic music retrieving, but synchronously it is still a challenging errand to extract the most representative and salient attributes to depict diversiform artists. In this paper, we developed a novel system to complete the reorganization of artist automatically. The proposed system can efficiently identify the artist's voice of a raw song by analyzing substantive features extracted from both pure music and singing song mixed with accompanying music. The experiments on different genres of songs illustrate that the proposed system is possible.

  16. Automatic Facial Expression Recognition and Operator Functional State

    Science.gov (United States)

    Blanson, Nina

    2011-01-01

    The prevalence of human error in safety-critical occupations remains a major challenge to mission success despite increasing automation in control processes. Although various methods have been proposed to prevent incidences of human error, none of these have been developed to employ the detection and regulation of Operator Functional State (OFS), or the optimal condition of the operator while performing a task, in work environments due to drawbacks such as obtrusiveness and impracticality. A video-based system with the ability to infer an individual's emotional state from facial feature patterning mitigates some of the problems associated with other methods of detecting OFS, like obtrusiveness and impracticality in integration with the mission environment. This paper explores the utility of facial expression recognition as a technology for inferring OFS by first expounding on the intricacies of OFS and the scientific background behind emotion and its relationship with an individual's state. Then, descriptions of the feedback loop and the emotion protocols proposed for the facial recognition program are explained. A basic version of the facial expression recognition program uses Haar classifiers and OpenCV libraries to automatically locate key facial landmarks during a live video stream. Various methods of creating facial expression recognition software are reviewed to guide future extensions of the program. The paper concludes with an examination of the steps necessary in the research of emotion and recommendations for the creation of an automatic facial expression recognition program for use in real-time, safety-critical missions.

  17. Automatic Facial Expression Recognition and Operator Functional State

    Science.gov (United States)

    Blanson, Nina

    2012-01-01

    The prevalence of human error in safety-critical occupations remains a major challenge to mission success despite increasing automation in control processes. Although various methods have been proposed to prevent incidences of human error, none of these have been developed to employ the detection and regulation of Operator Functional State (OFS), or the optimal condition of the operator while performing a task, in work environments due to drawbacks such as obtrusiveness and impracticality. A video-based system with the ability to infer an individual's emotional state from facial feature patterning mitigates some of the problems associated with other methods of detecting OFS, like obtrusiveness and impracticality in integration with the mission environment. This paper explores the utility of facial expression recognition as a technology for inferring OFS by first expounding on the intricacies of OFS and the scientific background behind emotion and its relationship with an individual's state. Then, descriptions of the feedback loop and the emotion protocols proposed for the facial recognition program are explained. A basic version of the facial expression recognition program uses Haar classifiers and OpenCV libraries to automatically locate key facial landmarks during a live video stream. Various methods of creating facial expression recognition software are reviewed to guide future extensions of the program. The paper concludes with an examination of the steps necessary in the research of emotion and recommendations for the creation of an automatic facial expression recognition program for use in real-time, safety-critical missions

  18. High Range Resolution Profile Automatic Target Recognition Using Sparse Representation

    Institute of Scientific and Technical Information of China (English)

    Zhou Nuo; Chen Wei

    2010-01-01

    Sparse representation is a new signal analysis method which is receiving increasing attention in recent years.In this article,a novel scheme solving high range resolution profile automatic target recognition for ground moving targets is proposed.The sparse representation theory is applied to analyzing the components of high range resolution profiles and sparse coefficients are used to describe their features.Numerous experiments with the target type number ranging from 2 to 6 have been implemented.Results show that the proposed scheme not only provides higher recognition preciseness in real time,but also achieves more robust performance as the target type number increases.

  19. Disordered Speech Assessment Using Automatic Methods Based on Quantitative Measures

    Directory of Open Access Journals (Sweden)

    Shrivastav Rahul

    2005-01-01

    Full Text Available Speech quality assessment methods are necessary for evaluating and documenting treatment outcomes of patients suffering from degraded speech due to Parkinson's disease, stroke, or other disease processes. Subjective methods of speech quality assessment are more accurate and more robust than objective methods but are time-consuming and costly. We propose a novel objective measure of speech quality assessment that builds on traditional speech processing techniques such as dynamic time warping (DTW and the Itakura-Saito (IS distortion measure. Initial results show that our objective measure correlates well with the more expensive subjective methods.

  20. SPEECH PROCESSING –AN OVERVIEW

    Directory of Open Access Journals (Sweden)

    A.INDUMATHI

    2012-06-01

    Full Text Available One of the earliest goals of speech processing was coding speech for efficient transmission. Later, the research spread in various area like Automatic Speech Recognition (ASR, Speech Synthesis (TTS,Speech Enhancement, Automatic Language Translation (ALT.Initially, ASR is used to recognize single words in a small vocabulary, later many product was developed for continuous speech for large vocabulary.Speech Synthesis is used for synthesizing the speech corresponding to a given text Speech Synthesis provide a way to communicate for persons unable to speak. When Speech Synthesis used together withASR, it allows a complete two-way spoken interaction between humans and machines. Speech Enhancement technique is applied to improve the quality of speech signal. Automatic Language Translation helps toconvert one language into another language. Basic concept of speech processing is provided for beginners.

  1. Lexical decoder for continuous speech recognition: sequential neural network approach

    International Nuclear Information System (INIS)

    The work presented in this dissertation concerns the study of a connectionist architecture to treat sequential inputs. In this context, the model proposed by J.L. Elman, a recurrent multilayers network, is used. Its abilities and its limits are evaluated. Modifications are done in order to treat erroneous or noisy sequential inputs and to classify patterns. The application context of this study concerns the realisation of a lexical decoder for analytical multi-speakers continuous speech recognition. Lexical decoding is completed from lattices of phonemes which are obtained after an acoustic-phonetic decoding stage relying on a K Nearest Neighbors search technique. Test are done on sentences formed from a lexicon of 20 words. The results are obtained show the ability of the proposed connectionist model to take into account the sequentiality at the input level, to memorize the context and to treat noisy or erroneous inputs. (author)

  2. Robust Digital Speech Watermarking For Online Speaker Recognition

    Directory of Open Access Journals (Sweden)

    Mohammad Ali Nematollahi

    2015-01-01

    Full Text Available A robust and blind digital speech watermarking technique has been proposed for online speaker recognition systems based on Discrete Wavelet Packet Transform (DWPT and multiplication to embed the watermark in the amplitudes of the wavelet’s subbands. In order to minimize the degradation effect of the watermark, these subbands are selected where less speaker-specific information was available (500 Hz–3500 Hz and 6000 Hz–7000 Hz. Experimental results on Texas Instruments Massachusetts Institute of Technology (TIMIT, Massachusetts Institute of Technology (MIT, and Mobile Biometry (MOBIO show that the degradation for speaker verification and identification is 1.16% and 2.52%, respectively. Furthermore, the proposed watermark technique can provide enough robustness against different signal processing attacks.

  3. Design of Automatic Recognition of Cucumber Disease Image

    OpenAIRE

    Peng Guo; Tonghai Liu; Naixiang Li

    2014-01-01

    An automatic recognition method for cucumber disease images is presented. Threshold for image segmentation was generated with 2 dimensional maximum entropy principle and optimized with differential evolution algorithm. With threshold values generated, we segmented cucumber disease images and picked up the lesion with maximum area from segmentation results as representative lesion. Then we analyzed representative lesions of disease images and extracted theirs color features and texture feature...

  4. Automatic facial expression recognition: a discrete choice approach

    OpenAIRE

    Bierlaire, Michel

    2009-01-01

    Automatic facial expression recognition finds applications in various fields where human-machine interactions are involved. We propose a framework based on discrete choice models, where we try to forecast how a human person would evaluate the facial expression, choosing the most appropriate label among a given list. After having applied the framework successfully on static images, we investigate the possibility to apply it on video sequences.

  5. Analyzing and Improving Statistical Language Models for Speech Recognition

    CERN Document Server

    Ueberla, J P

    1994-01-01

    In many current speech recognizers, a statistical language model is used to indicate how likely it is that a certain word will be spoken next, given the words recognized so far. How can statistical language models be improved so that more complex speech recognition tasks can be tackled? Since the knowledge of the weaknesses of any theory often makes improving the theory easier, the central idea of this thesis is to analyze the weaknesses of existing statistical language models in order to subsequently improve them. To that end, we formally define a weakness of a statistical language model in terms of the logarithm of the total probability, LTP, a term closely related to the standard perplexity measure used to evaluate statistical language models. We apply our definition of a weakness to a frequently used statistical language model, called a bi-pos model. This results, for example, in a new modeling of unknown words which improves the performance of the model by 14% to 21%. Moreover, one of the identified weak...

  6. The effect of automatic gain control structure and release time on cochlear implant speech intelligibility.

    Directory of Open Access Journals (Sweden)

    Phyu P Khing

    Full Text Available Nucleus cochlear implant systems incorporate a fast-acting front-end automatic gain control (AGC, sometimes called a compression limiter. The objective of the present study was to determine the effect of replacing the front-end compression limiter with a newly proposed envelope profile limiter. A secondary objective was to investigate the effect of AGC speed on cochlear implant speech intelligibility. The envelope profile limiter was located after the filter bank and reduced the gain when the largest of the filter bank envelopes exceeded the compression threshold. The compression threshold was set equal to the saturation level of the loudness growth function (i.e. the envelope level that mapped to the maximum comfortable current level, ensuring that no envelope clipping occurred. To preserve the spectral profile, the same gain was applied to all channels. Experiment 1 compared sentence recognition with the front-end limiter and with the envelope profile limiter, each with two release times (75 and 625 ms. Six implant recipients were tested in quiet and in four-talker babble noise, at a high presentation level of 89 dB SPL. Overall, release time had a larger effect than the AGC type. With both AGC types, speech intelligibility was lower for the 75 ms release time than for the 625 ms release time. With the shorter release time, the envelope profile limiter provided higher group mean scores than the front-end limiter in quiet, but there was no significant difference in noise. Experiment 2 measured sentence recognition in noise as a function of presentation level, from 55 to 89 dB SPL. The envelope profile limiter with 625 ms release time yielded better scores than the front-end limiter with 75 ms release time. A take-home study showed no clear pattern of preferences. It is concluded that the envelope profile limiter is a feasible alternative to a front-end compression limiter.

  7. Advanced automatic target recognition for police helicopter missions

    Science.gov (United States)

    Stahl, Christoph; Schoppmann, Paul

    2000-08-01

    The results of a case study about the application of an advanced method for automatic target recognition to infrared imagery taken from police helicopter missions are presented. The method consists of the following steps: preprocessing, classification, fusion, postprocessing and tracking, and combines the three paradigms image pyramids, neural networks and bayesian nets. The technology has been developed using a variety of different scenes typical for military aircraft missions. Infrared cameras have been in use for several years at the Bavarian police helicopter forces and are highly valuable for night missions. Several object classes like 'persons' or 'vehicles' are tested and the possible discrimination between persons and animals is shown. The analysis of complex scenes with hidden objects and clutter shows the potentials and limitations of automatic target recognition for real-world tasks. Several display concepts illustrate the achievable improvement of the situation awareness. The similarities and differences between various mission types concerning object variability, time constraints, consequences of false alarms, etc. are discussed. Typical police actions like searching for missing persons or runaway criminals illustrate the advantages of automatic target recognition. The results demonstrate the possible operational benefits for the helicopter crew. Future work will include performance evaluation issues and a system integration concept for the target platform.

  8. Automatic Recognition of Facial Actions in Spontaneous Expressions

    Directory of Open Access Journals (Sweden)

    Marian Stewart Bartlett

    2006-09-01

    Full Text Available Spontaneous facial expressions differ from posed expressions in both which muscles are moved, and in the dynamics of the movement. Advances in the field of automatic facial expression measurement will require development and assessment on spontaneous behavior. Here we present preliminary results on a task of facial action detection in spontaneous facial expressions. We employ a user independent fully automatic system for real time recognition of facial actions from the Facial Action Coding System (FACS. The system automatically detects frontal faces in the video stream and coded each frame with respect to 20 Action units. The approach applies machine learning methods such as support vector machines and AdaBoost, to texture-based image representations. The output margin for the learned classifiers predicts action unit intensity. Frame-by-frame intensity measurements will enable investigations into facial expression dynamics which were previously intractable by human coding.

  9. Benefits of spatial hearing to speech recognition in young people with normal hearing

    Institute of Scientific and Technical Information of China (English)

    SONG Peng-long; LI Hui-jun; WANG Ning-yu

    2011-01-01

    Background Many factors interfering with a listener attempting to grasp speech in noisy environments.The spatial hearing by which speech and noise can be spatially separated may play a crucial role in speech recognition in the presence of competing noise.This study aimed to assess whether,and to what degree,spatial hearing benefit speech recognition in young normal-hearing participants in both quiet and noisy environments.Methods Twenty-eight young participants were tested by Mandarin Hearing In Noise Test (MHINT) in quiet and noisy environments.The assessment method used was characterized by modifications of speech and noise configurations,as well as by changes of speech presentation mode.The benefit of spatial hearing was measured by speech recognition threshold (SRT) variation between speech condition 1 (SC1) and speech condition 2 (SC2).Results There was no significant difference found in the SRT between SC1 and SC2 in quiet.SRT in SC1 was about 4.2 dB lower than that in SC2,both in speech-shaped and four-babble noise conditions.SRTs measured in both SC1 and SC2 were lower in the speech-shaped noise condition than in the four-babble noise condition.Conclusion Spatial hearing in young normal-hearing participants contribute to speech recognition in noisy environments,but provide no benefit to speech recognition in quiet environments,which may be due to the offset of auditory extrinsic redundancy against the lack of spatial hearing.

  10. A system of automatic speaker recognition on a minicomputer

    International Nuclear Information System (INIS)

    This study describes a system of automatic speaker recognition using the pitch of the voice. The pre-treatment consists in the extraction of the speakers' discriminating characteristics taken from the pitch. The programme of recognition gives, firstly, a preselection and then calculates the distance between the speaker's characteristics to be recognized and those of the speakers already recorded. An experience of recognition has been realized. It has been undertaken with 15 speakers and included 566 tests spread over an intermittent period of four months. The discriminating characteristics used offer several interesting qualities. The algorithms concerning the measure of the characteristics on one hand, the speakers' classification on the other hand, are simple. The results obtained in real time with a minicomputer are satisfactory. Furthermore they probably could be improved if we considered other speaker's discriminating characteristics but this was unfortunately not in our possibilities. (author)

  11. Integranting prosodic information into a speech recogniser

    OpenAIRE

    López Soto, María Teresa

    2001-01-01

    In the last decade there has been an increasing tendency to incorporate language engineering strategies into speech technology. This technique combines linguistic and mathematical information in different applications: machine translation, natural language processing, speech synthesis and automatic speech recognition (ASR). In the field of speech synthesis, this hybrid approach (linguistic and mathematical/statistical) has led to the design of efficient models for reproducin...

  12. A Support Vector Machine-Based Dynamic Network for Visual Speech Recognition Applications

    Directory of Open Access Journals (Sweden)

    Gordan Mihaela

    2002-01-01

    Full Text Available Visual speech recognition is an emerging research field. In this paper, we examine the suitability of support vector machines for visual speech recognition. Each word is modeled as a temporal sequence of visemes corresponding to the different phones realized. One support vector machine is trained to recognize each viseme and its output is converted to a posterior probability through a sigmoidal mapping. To model the temporal character of speech, the support vector machines are integrated as nodes into a Viterbi lattice. We test the performance of the proposed approach on a small visual speech recognition task, namely the recognition of the first four digits in English. The word recognition rate obtained is at the level of the previous best reported rates.

  13. Impact of noise and other factors on speech recognition in anaesthesia

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    Introduction: Speech recognition is currently being deployed in medical and anaesthesia applications. This article is part of a project to investigate and further develop a prototype of a speech-input interface in Danish for an electronic anaesthesia patient record, to be used in real time during...... operations. Objective: The aim of the experiment is to evaluate the relative impact of several factors affecting speech recognition when used in operating rooms, such as the type or loudness of background noises, type of microphone, type of recognition mode (free speech versus command mode), and type of...... predominant effect, recognition rates for common noises (e.g. ventilation, alarms) are only slightly below rates obtained in a quiet environment. Finally, a redundant architecture succeeds in improving the reliability of the recognitions. Conclusion: This study removes some uncertainties regarding the...

  14. Recognition of speaker-dependent continuous speech with KEAL

    Science.gov (United States)

    Mercier, G.; Bigorgne, D.; Miclet, L.; Le Guennec, L.; Querre, M.

    1989-04-01

    A description of the speaker-dependent continuous speech recognition system KEAL is given. An unknown utterance, is recognized by means of the followng procedures: acoustic analysis, phonetic segmentation and identification, word and sentence analysis. The combination of feature-based, speaker-independent coarse phonetic segmentation with speaker-dependent statistical classification techniques is one of the main design features of the acoustic-phonetic decoder. The lexical access component is essentially based on a statistical dynamic programming technique which aims at matching a phonemic lexical entry containing various phonological forms, against a phonetic lattice. Sentence recognition is achieved by use of a context-free grammar and a parsing algorithm derived from Earley's parser. A speaker adaptation module allows some of the system parameters to be adjusted by matching known utterances with their acoustical representation. The task to be performed, described by its vocabulary and its grammar, is given as a parameter of the system. Continuously spoken sentences extracted from a 'pseudo-Logo' language are analyzed and results are presented.

  15. Evaluation of automatic face recognition for automatic border control on actual data recorded of travellers at Schiphol Airport

    NARCIS (Netherlands)

    Spreeuwers, L.J.; Hendrikse, A.J.; Gerritsen, K.J.; Brömme, A.; Busch, C.

    2012-01-01

    Automatic border control at airports using automated facial recognition for checking the passport is becoming more and more common. A problem is that it is not clear how reliable these automatic gates are. Very few independent studies exist that assess the reliability of automated facial recognition

  16. Advancing Electromyographic Continuous Speech Recognition: Signal Preprocessing and Modeling

    OpenAIRE

    Wand, Michael

    2014-01-01

    Speech is the natural medium of human communication, but audible speech can be overheard by bystanders and excludes speech-disabled people. This work presents a speech recognizer based on surface electromyography, where electric potentials of the facial muscles are captured by surface electrodes, allowing speech to be processed nonacoustically. A system which was state-of-the-art at the beginning of this thesis is substantially improved in terms of accuracy, flexibility, and robustness.

  17. Speech recognition interference by the temporal and spectral properties of a single competing talker.

    Science.gov (United States)

    Fogerty, Daniel; Xu, Jiaqian

    2016-08-01

    This study investigated how speech recognition during speech-on-speech masking may be impaired due to the interaction between amplitude modulations of the target and competing talker. Young normal-hearing adults were tested in a competing talker paradigm where the target and/or competing talker was processed to primarily preserve amplitude modulation cues. Effects of talker sex and linguistic interference were also examined. Results suggest that performance patterns for natural speech-on-speech conditions are largely consistent with the same masking patterns observed for signals primarily limited to temporal amplitude modulations. However, results also suggest a role for spectral cues in talker segregation and linguistic competition. PMID:27586780

  18. A Weighted Discrete KNN Method for Mandarin Speech and Emotion Recognition

    OpenAIRE

    Pao, Tsang-Long; Liao, Wen-Yuan; Chen, Yu-Te

    2008-01-01

    In this chapter, we present a speech emotion recognition system to compare several classifiers on the clean speech and noisy speech. Our proposed WD-KNN classifier outperforms the other three KNN-based classifiers at every SNR level and achieves highest accuracy from clean speech to 20dB noisy speech when compared with all other classifiers. Similar to (Neiberg et al, 2006), GMM is a feasible technique for emotion classification on the frame level and the results of GMM are better than perfor...

  19. Significance of parametric spectral ratio methods in detection and recognition of whispered speech

    Science.gov (United States)

    Mathur, Arpit; Reddy, Shankar M.; Hegde, Rajesh M.

    2012-12-01

    In this article the significance of a new parametric spectral ratio method that can be used to detect whispered speech segments within normally phonated speech is described. Adaptation methods based on the maximum likelihood linear regression (MLLR) are then used to realize a mismatched train-test style speech recognition system. This proposed parametric spectral ratio method computes a ratio spectrum of the linear prediction (LP) and the minimum variance distortion-less response (MVDR) methods. The smoothed ratio spectrum is then used to detect whispered segments of speech within neutral speech segments effectively. The proposed LP-MVDR ratio method exhibits robustness at different SNRs as indicated by the whisper diarization experiments conducted on the CHAINS and the cell phone whispered speech corpus. The proposed method also performs reasonably better than the conventional methods for whisper detection. In order to integrate the proposed whisper detection method into a conventional speech recognition engine with minimal changes, adaptation methods based on the MLLR are used herein. The hidden Markov models corresponding to neutral mode speech are adapted to the whispered mode speech data in the whispered regions as detected by the proposed ratio method. The performance of this method is first evaluated on whispered speech data from the CHAINS corpus. The second set of experiments are conducted on the cell phone corpus of whispered speech. This corpus is collected using a set up that is used commercially for handling public transactions. The proposed whisper speech recognition system exhibits reasonably better performance when compared to several conventional methods. The results shown indicate the possibility of a whispered speech recognition system for cell phone based transactions.

  20. Error Rates in Users of Automatic Face Recognition Software.

    Directory of Open Access Journals (Sweden)

    David White

    Full Text Available In recent years, wide deployment of automatic face recognition systems has been accompanied by substantial gains in algorithm performance. However, benchmarking tests designed to evaluate these systems do not account for the errors of human operators, who are often an integral part of face recognition solutions in forensic and security settings. This causes a mismatch between evaluation tests and operational accuracy. We address this by measuring user performance in a face recognition system used to screen passport applications for identity fraud. Experiment 1 measured target detection accuracy in algorithm-generated 'candidate lists' selected from a large database of passport images. Accuracy was notably poorer than in previous studies of unfamiliar face matching: participants made over 50% errors for adult target faces, and over 60% when matching images of children. Experiment 2 then compared performance of student participants to trained passport officers-who use the system in their daily work-and found equivalent performance in these groups. Encouragingly, a group of highly trained and experienced "facial examiners" outperformed these groups by 20 percentage points. We conclude that human performance curtails accuracy of face recognition systems-potentially reducing benchmark estimates by 50% in operational settings. Mere practise does not attenuate these limits, but superior performance of trained examiners suggests that recruitment and selection of human operators, in combination with effective training and mentorship, can improve the operational accuracy of face recognition systems.

  1. Automatic integration of social information in emotion recognition.

    Science.gov (United States)

    Mumenthaler, Christian; Sander, David

    2015-04-01

    This study investigated the automaticity of the influence of social inference on emotion recognition. Participants were asked to recognize dynamic facial expressions of emotion (fear or anger in Experiment 1 and blends of fear and surprise or of anger and disgust in Experiment 2) in a target face presented at the center of a screen while a subliminal contextual face appearing in the periphery expressed an emotion (fear or anger) or not (neutral) and either looked at the target face or not. Results of Experiment 1 revealed that recognition of the target emotion of fear was improved when a subliminal angry contextual face gazed toward-rather than away from-the fearful face. We replicated this effect in Experiment 2, in which facial expression blends of fear and surprise were more often and more rapidly categorized as expressing fear when the subliminal contextual face expressed anger and gazed toward-rather than away from-the target face. With the contextual face appearing for 30 ms in total, including only 10 ms of emotion expression, and being immediately masked, our data provide the first evidence that social influence on emotion recognition can occur automatically. PMID:25688908

  2. A Comparative Study: Gammachirp Wavelets and Auditory Filter Using Prosodic Features of Speech Recognition In Noisy Environment

    Directory of Open Access Journals (Sweden)

    Hajer Rahali

    2014-04-01

    Full Text Available Modern automatic speech recognition (ASR systems typically use a bank of linear filters as the first step in performing frequency analysis of speech. On the other hand, the cochlea, which is responsible for frequency analysis in the human auditory system, is known to have a compressive non-linear frequency response which depends on input stimulus level. It will be shown in this paper that it presents a new method on the use of the gammachirp auditory filter based on a continuous wavelet analysis. The essential characteristic of this model is that it proposes an analysis by wavelet packet transformation on the frequency bands that come closer the critical bands of the ear that differs from the existing model based on an analysis by a short term Fourier transformation (STFT. The prosodic features such as pitch, formant frequency, jitter and shimmer are extracted from the fundamental frequency contour and added to baseline spectral features, specifically, Mel Frequency Cepstral Coefficients (MFCC for human speech, Gammachirp Filterbank Cepstral Coefficient (GFCC and Gammachirp Wavelet Frequency Cepstral Coefficient (GWFCC. The results show that the gammachirp wavelet gives results that are comparable to ones obtained by MFCC and GFCC. Experimental results show the best performance of this architecture. This paper implements the GW and examines its application to a specific example of speech. Implications for noise robust speech analysis are also discussed within AURORA databases.

  3. I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance

    Science.gov (United States)

    Hantke, Simone; Weninger, Felix; Kurle, Richard; Ringeval, Fabien; Batliner, Anton; Mousa, Amr El-Desoky; Schuller, Björn

    2016-01-01

    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient. PMID:27176486

  4. I Hear You Eat and Speak: Automatic Recognition of Eating Condition and Food Type, Use-Cases, and Impact on ASR Performance.

    Science.gov (United States)

    Hantke, Simone; Weninger, Felix; Kurle, Richard; Ringeval, Fabien; Batliner, Anton; Mousa, Amr El-Desoky; Schuller, Björn

    2016-01-01

    We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard deviation: 2.66 years, gender balanced, German speakers), six types of food (Apple, Nectarine, Banana, Haribo Smurfs, Biscuit, and Crisps), and read as well as spontaneous speech, which is made publicly available for research purposes. We start with demonstrating that for automatic speech recognition (ASR), it pays off to know whether speakers are eating or not. We also propose automatic classification both by brute-forcing of low-level acoustic features as well as higher-level features related to intelligibility, obtained from an Automatic Speech Recogniser. Prediction of the eating condition was performed with a Support Vector Machine (SVM) classifier employed in a leave-one-speaker-out evaluation framework. Results show that the binary prediction of eating condition (i. e., eating or not eating) can be easily solved independently of the speaking condition; the obtained average recalls are all above 90%. Low-level acoustic features provide the best performance on spontaneous speech, which reaches up to 62.3% average recall for multi-way classification of the eating condition, i. e., discriminating the six types of food, as well as not eating. The early fusion of features related to intelligibility with the brute-forced acoustic feature set improves the performance on read speech, reaching a 66.4% average recall for the multi-way classification task. Analysing features and classifier errors leads to a suitable ordinal scale for eating conditions, on which automatic regression can be performed with up to 56.2% determination coefficient. PMID:27176486

  5. Visual abilities are important for auditory-only speech recognition: evidence from autism spectrum disorder.

    Science.gov (United States)

    Schelinski, Stefanie; Riedel, Philipp; von Kriegstein, Katharina

    2014-12-01

    In auditory-only conditions, for example when we listen to someone on the phone, it is essential to fast and accurately recognize what is said (speech recognition). Previous studies have shown that speech recognition performance in auditory-only conditions is better if the speaker is known not only by voice, but also by face. Here, we tested the hypothesis that such an improvement in auditory-only speech recognition depends on the ability to lip-read. To test this we recruited a group of adults with autism spectrum disorder (ASD), a condition associated with difficulties in lip-reading, and typically developed controls. All participants were trained to identify six speakers by name and voice. Three speakers were learned by a video showing their face and three others were learned in a matched control condition without face. After training, participants performed an auditory-only speech recognition test that consisted of sentences spoken by the trained speakers. As a control condition, the test also included speaker identity recognition on the same auditory material. The results showed that, in the control group, performance in speech recognition was improved for speakers known by face in comparison to speakers learned in the matched control condition without face. The ASD group lacked such a performance benefit. For the ASD group auditory-only speech recognition was even worse for speakers known by face compared to speakers not known by face. In speaker identity recognition, the ASD group performed worse than the control group independent of whether the speakers were learned with or without face. Two additional visual experiments showed that the ASD group performed worse in lip-reading whereas face identity recognition was within the normal range. The findings support the view that auditory-only communication involves specific visual mechanisms. Further, they indicate that in ASD, speaker-specific dynamic visual information is not available to optimize auditory

  6. A speech recognition system for data collection in precision agriculture

    Science.gov (United States)

    Dux, David Lee

    Agricultural producers have shown interest in collecting detailed, accurate, and meaningful field data through field scouting, but scouting is labor intensive. They use yield monitor attachments to collect weed and other field data while driving equipment. However, distractions from using a keyboard or buttons while driving can lead to driving errors or missed data points. At Purdue University, researchers have developed an ASR system to allow equipment operators to collect georeferenced data while keeping hands and eyes on the machine during harvesting and to ease georeferencing of data collected during scouting. A notebook computer retrieved locations from a GPS unit and displayed and stored data in Excel. A headset microphone with a single earphone collected spoken input while allowing the operator to hear outside sounds. One-, two-, or three-word commands activated appropriate VBA macros. Four speech recognition products were chosen based on hardware requirements and ability to add new terms. After training, speech recognition accuracy was 100% for Kurzweil VoicePlus and Verbex Listen for the 132 vocabulary words tested, during tests walking outdoors or driving an ATV. Scouting tests were performed by carrying the system in a backpack while walking in soybean fields. The system recorded a point or a series of points with each utterance. Boundaries of points showed problem areas in the field and single points marked rocks and field corners. Data were displayed as an Excel chart to show a real-time map as data were collected. The information was later displayed in a GIS over remote sensed field images. Field corners and areas of poor stand matched, with voice data explaining anomalies in the image. The system was tested during soybean harvest by using voice to locate weed patches. A harvester operator with little computer experience marked points by voice when the harvester entered and exited weed patches or areas with poor crop stand. The operator found the

  7. Automatic voice recognition using traditional and artificial neural network approaches

    Science.gov (United States)

    Botros, Nazeih M.

    1989-01-01

    The main objective of this research is to develop an algorithm for isolated-word recognition. This research is focused on digital signal analysis rather than linguistic analysis of speech. Features extraction is carried out by applying a Linear Predictive Coding (LPC) algorithm with order of 10. Continuous-word and speaker independent recognition will be considered in future study after accomplishing this isolated word research. To examine the similarity between the reference and the training sets, two approaches are explored. The first is implementing traditional pattern recognition techniques where a dynamic time warping algorithm is applied to align the two sets and calculate the probability of matching by measuring the Euclidean distance between the two sets. The second is implementing a backpropagation artificial neural net model with three layers as the pattern classifier. The adaptation rule implemented in this network is the generalized least mean square (LMS) rule. The first approach has been accomplished. A vocabulary of 50 words was selected and tested. The accuracy of the algorithm was found to be around 85 percent. The second approach is in progress at the present time.

  8. A pattern recognition approach based on DTW for automatic transient identification in nuclear power plants

    International Nuclear Information System (INIS)

    Highlights: • Novel transient identification method for NPPs. • Low-complexity. • Low training data requirements. • High accuracy. • Fully reproducible protocol carried out on a real benchmark. - Abstract: Automatic identification of transients in nuclear power plants (NPPs) allows monitoring the fatigue damage accumulated by critical components during plant operation, and is therefore of great importance for ensuring that usage factors remain within the original design bases postulated by the plant designer. Although several schemes to address this important issue have been explored in the literature, there is still no definitive solution available. In the present work, a new method for automatic transient identification is proposed, based on the Dynamic Time Warping (DTW) algorithm, largely used in other related areas such as signature or speech recognition. The novel transient identification system is evaluated on real operational data following a rigorous pattern recognition protocol. Results show the high accuracy of the proposed approach, which is combined with other interesting features such as its low complexity and its very limited requirements of training data

  9. Introduction and Overview of the Vicens-Reddy Speech Recognition System.

    Science.gov (United States)

    Kameny, Iris; Ritea, H.

    The Vicens-Reddy System is unique in the sense that it approaches the problem of speech recognition as a whole, rather than treating particular aspects of the problems as in previous attempts. For example, where earlier systems treated only segmentation of speech into phoneme groups, or detected phonemes in a given context, the Vicens-Reddy System…

  10. CCD camera automatic calibration technology and ellipse recognition algorithm

    Institute of Scientific and Technical Information of China (English)

    Changku Sun; Xiaodong Zhang; Yunxia Qu

    2005-01-01

    A novel two-dimensional (2D) pattern used in camera calibration is presented. With one feature circle located at the center, an array of circles is photo-etched on this pattern. An ellipse recognition algorithm is proposed to implement the acquisition of interest calibration points without human intervention. According to the circle arrangement of the pattern, the relation between three-dimensional (3D) and 2D coordinates of these points can be established automatically and accurately. These calibration points are computed for intrinsic parameters calibration of charge-coupled device (CCD) camera with Tsai method. A series of experiments have shown that the algorithm is robust and reliable with the calibration error less than 0.4 pixel. This new calibration pattern and ellipse recognition algorithm can be widely used in computer vision.

  11. Immediate and sustained benefits of a “total” implementation of speech recognition reporting

    OpenAIRE

    Hart, J. L.; McBride, A; Blunt, D.; Gishen, P; STRICKLAND, N.

    2010-01-01

    Speech recognition reporting was introduced in our institution to address the significant delay between report dictation and the appearance of a typed report on the Picture Archiving and Communication System (PACS). We report our experience of a “total” implementation of a speech recognition reporting (SRR) system, which became the sole means of radiology reporting from day 1 of introduction. Prospectively gathered Radiology Information System (RIS) data were examined to determine the monthly...

  12. Influence of GSM speech coding on the performance of text-independent speaker recognition

    OpenAIRE

    Grassi, Sara; Besacier, Laurent; DUFAUX, Alain; Ansorge, Michael; Pellandini, Fausto

    2006-01-01

    We have investigated the influence of GSM speech coding in the performance of a text-independent speaker recognition system based on Gaussian Mixture Models (GMM). The performance degradation due to the utilization of the three GSM speech coders was assessed, using three transcoded databases, obtained by passing the TIMIT through each GSM coder/decoder. The recognition performance was also assessed using the original TIMIT and its 8 kHz downsampled version. Then, different experiments were ca...

  13. A Cognitive Science Reasoning in Recognition of Emotions in Audio-Visual Speech

    OpenAIRE

    Slavova, Velina; Verhelst, Werner; Sahli, Hichem

    2008-01-01

    In this report we summarize the state-of-the-art of speech emotion recognition from the signal processing point of view. On the bases of multi-corporal experiments with machine-learning classifiers, the observation is made that existing approaches for supervised machine learning lead to database dependent classifiers which can not be applied for multi-language speech emotion recognition without additional training because they discriminate the emotion classes following the use...

  14. Automatic radar target recognition of objects falling on railway tracks

    International Nuclear Information System (INIS)

    This paper presents an automatic radar target recognition procedure based on complex resonances using the signals provided by ultra-wideband radar. This procedure is dedicated to detection and identification of objects lying on railway tracks. For an efficient complex resonance extraction, a comparison between several pole extraction methods is illustrated. Therefore, preprocessing methods are presented aiming to remove most of the erroneous poles interfering with the discrimination scheme. Once physical poles are determined, a specific discrimination technique is introduced based on the Euclidean distances. Both simulation and experimental results are depicted showing an efficient discrimination of different targets including guided transport passengers

  15. Influence of native and non-native multitalker babble on speech recognition in noise

    Directory of Open Access Journals (Sweden)

    Chandni Jain

    2014-03-01

    Full Text Available The aim of the study was to assess speech recognition in noise using multitalker babble of native and non-native language at two different signal to noise ratios. The speech recognition in noise was assessed on 60 participants (18 to 30 years with normal hearing sensitivity, having Malayalam and Kannada as their native language. For this purpose, 6 and 10 multitalker babble were generated in Kannada and Malayalam language. Speech recognition was assessed for native listeners of both the languages in the presence of native and nonnative multitalker babble. Results showed that the speech recognition in noise was significantly higher for 0 dB signal to noise ratio (SNR compared to -3 dB SNR for both the languages. Performance of Kannada Listeners was significantly higher in the presence of native (Kannada babble compared to non-native babble (Malayalam. However, this was not same with the Malayalam listeners wherein they performed equally well with native (Malayalam as well as non-native babble (Kannada. The results of the present study highlight the importance of using native multitalker babble for Kannada listeners in lieu of non-native babble and, considering the importance of each SNR for estimating speech recognition in noise scores. Further research is needed to assess speech recognition in Malayalam listeners in the presence of other non-native backgrounds of various types.

  16. Is Listening in Noise Worth It? The Neurobiology of Speech Recognition in Challenging Listening Conditions.

    Science.gov (United States)

    Eckert, Mark A; Teubner-Rhodes, Susan; Vaden, Kenneth I

    2016-01-01

    This review examines findings from functional neuroimaging studies of speech recognition in noise to provide a neural systems level explanation for the effort and fatigue that can be experienced during speech recognition in challenging listening conditions. Neuroimaging studies of speech recognition consistently demonstrate that challenging listening conditions engage neural systems that are used to monitor and optimize performance across a wide range of tasks. These systems appear to improve speech recognition in younger and older adults, but sustained engagement of these systems also appears to produce an experience of effort and fatigue that may affect the value of communication. When considered in the broader context of the neuroimaging and decision making literature, the speech recognition findings from functional imaging studies indicate that the expected value, or expected level of speech recognition given the difficulty of listening conditions, should be considered when measuring effort and fatigue. The authors propose that the behavioral economics or neuroeconomics of listening can provide a conceptual and experimental framework for understanding effort and fatigue that may have clinical significance. PMID:27355759

  17. Influences of Infant-Directed Speech on Early Word Recognition

    Science.gov (United States)

    Singh, Leher; Nestor, Sarah; Parikh, Chandni; Yull, Ashley

    2009-01-01

    When addressing infants, many adults adopt a particular type of speech, known as infant-directed speech (IDS). IDS is characterized by exaggerated intonation, as well as reduced speech rate, shorter utterance duration, and grammatical simplification. It is commonly asserted that IDS serves in part to facilitate language learning. Although…

  18. Adoption of Speech Recognition Technology in Community Healthcare Nursing.

    Science.gov (United States)

    Al-Masslawi, Dawood; Block, Lori; Ronquillo, Charlene

    2016-01-01

    Adoption of new health information technology is shown to be challenging. However, the degree to which new technology will be adopted can be predicted by measures of usefulness and ease of use. In this work these key determining factors are focused on for design of a wound documentation tool. In the context of wound care at home, consistent with evidence in the literature from similar settings, use of Speech Recognition Technology (SRT) for patient documentation has shown promise. To achieve a user-centred design, the results from a conducted ethnographic fieldwork are used to inform SRT features; furthermore, exploratory prototyping is used to collect feedback about the wound documentation tool from home care nurses. During this study, measures developed for healthcare applications of the Technology Acceptance Model will be used, to identify SRT features that improve usefulness (e.g. increased accuracy, saving time) or ease of use (e.g. lowering mental/physical effort, easy to remember tasks). The identified features will be used to create a low fidelity prototype that will be evaluated in future experiments. PMID:27332294

  19. An analytical approach to photonic reservoir computing - a network of SOA's - for noisy speech recognition

    Science.gov (United States)

    Salehi, Mohammad Reza; Abiri, Ebrahim; Dehyadegari, Louiza

    2013-10-01

    This paper seeks to investigate an approach of photonic reservoir computing for optical speech recognition on an examination isolated digit recognition task. An analytical approach in photonic reservoir computing is further drawn on to decrease time consumption, compared to numerical methods; which is very important in processing large signals such as speech recognition. It is also observed that adjusting reservoir parameters along with a good nonlinear mapping of the input signal into the reservoir, analytical approach, would boost recognition accuracy performance. Perfect recognition accuracy (i.e. 100%) can be achieved for noiseless speech signals. For noisy signals with 0-10 db of signal to noise ratios, however, the accuracy ranges observed varied between 92% and 98%. In fact, photonic reservoir application demonstrated 9-18% improvement compared to classical reservoir networks with hyperbolic tangent nodes.

  20. Entrance C - New Automatic Number Plate Recognition System

    CERN Multimedia

    2013-01-01

    Entrance C (Satigny) is now equipped with a latest-generation Automatic Number Plate Recognition (ANPR) system and a fast-action road gate.   During the month of August, Entrance C will be continuously open from 7.00 a.m. to 7.00 p.m. (working days only). The security guards will open the gate as usual from 7.00 a.m. to 9.00 a.m. and from 5.00 p.m. to 7.00 p.m. For the rest of the working day (9.00 a.m. to 5.00 p.m.) the gate will operate automatically. Please observe the following points:       Stop at the STOP sign on the ground     Position yourself next to the card reader for optimal recognition     Motorcyclists must use their CERN card     Cyclists may not activate the gate and should use the bicycle turnstile     Keep a safe distance from the vehicle in front of you   If access is denied, please check that your vehicle regist...

  1. Acceptance of speech recognition by physicians: A survey of expectations, experiences, and social influence

    DEFF Research Database (Denmark)

    Alapetite, Alexandre; Andersen, Henning Boje; Hertzum, Morten

    2009-01-01

    secretarial transcription for all physicians in clinical departments. The aim of the survey was (i) to identify how attitudes and perceptions among physicians affected the acceptance and success of the speech-recognition system and the new work procedures associated with it; and (ii) to assess the degree to......The present study has surveyed physician views and attitudes before and after the introduction of speech technology as a front end to an electronic medical record. At the hospital where the survey was made, speech technology recently (2006–2007) replaced traditional dictation and subsequent...... which physicians’ attitudes and expectations to the use of speech technology changed after actually using it. The survey was based on two questionnaires—one administered when the physicians were about to begin training with the speech-recognition system and another, asking similar questions, when they...

  2. Statistical Language Modeling for Automatic Speech Recognition of Agglutinative Languages

    OpenAIRE

    Ar&#;soy, Ebru; Kurimo, Mikko; Sara&#;lar, Murat; Hirsim&#;ki, Teemu; Pylkk&#;nen, Janne; Alum&#;e, Tanel; Sak, Ha&#;im

    2008-01-01

    This work presents statistical language models trained on different agglutinative languages utilizing a lexicon based on the recently proposed unsupervised statistical morphs. The significance of this work is that similarly generated sub-word unit lexica are developed and successfully evaluated in three different LVCSR systems in different languages. In each case the morph-based approach is at least as good or better than a very large vocabulary wordbased LVCSR language model. Even though usi...

  3. Use of intonation contours for speech recognition in noise by cochlear implant recipients.

    Science.gov (United States)

    Meister, Hartmut; Landwehr, Markus; Pyschny, Verena; Grugel, Linda; Walger, Martin

    2011-05-01

    The corruption of intonation contours has detrimental effects on sentence-based speech recognition in normal-hearing listeners Binns and Culling [(2007). J. Acoust. Soc. Am. 122, 1765-1776]. This paper examines whether this finding also applies to cochlear implant (CI) recipients. The subjects' F0-discrimination and speech perception in the presence of noise were measured, using sentences with regular and inverted F0-contours. The results revealed that speech recognition for regular contours was significantly better than for inverted contours. This difference was related to the subjects' F0-discrimination providing further evidence that the perception of intonation patterns is important for the CI-mediated speech recognition in noise. PMID:21568376

  4. Combining Semantic and Acoustic Features for Valence and Arousal Recognition in Speech

    DEFF Research Database (Denmark)

    Karadogan, Seliz; Larsen, Jan

    2012-01-01

    The recognition of affect in speech has attracted a lot of interest recently; especially in the area of cognitive and computer sciences. Most of the previous studies focused on the recognition of basic emotions (such as happiness, sadness and anger) using categorical approach. Recently, the focus...

  5. The Effect of Asymmetrical Signal Degradation on Binaural Speech Recognition in Children and Adults.

    Science.gov (United States)

    Rothpletz, Ann M.; Tharpe, Anne Marie; Grantham, D. Wesley

    2004-01-01

    To determine the effect of asymmetrical signal degradation on binaural speech recognition, 28 children and 14 adults were administered a sentence recognition task amidst multitalker babble. There were 3 listening conditions: (a) monaural, with mild degradation in 1 ear; (b) binaural, with mild degradation in both ears (symmetric degradation); and…

  6. Effects of Semantic Context and Fundamental Frequency Contours on Mandarin Speech Recognition by Second Language Learners

    Science.gov (United States)

    Zhang, Linjun; Li, Yu; Wu, Han; Li, Xin; Shu, Hua; Zhang, Yang; Li, Ping

    2016-01-01

    Speech recognition by second language (L2) learners in optimal and suboptimal conditions has been examined extensively with English as the target language in most previous studies. This study extended existing experimental protocols (Wang et al., 2013) to investigate Mandarin speech recognition by Japanese learners of Mandarin at two different levels (elementary vs. intermediate) of proficiency. The overall results showed that in addition to L2 proficiency, semantic context, F0 contours, and listening condition all affected the recognition performance on the Mandarin sentences. However, the effects of semantic context and F0 contours on L2 speech recognition diverged to some extent. Specifically, there was significant modulation effect of listening condition on semantic context, indicating that L2 learners made use of semantic context less efficiently in the interfering background than in quiet. In contrast, no significant modulation effect of listening condition on F0 contours was found. Furthermore, there was significant interaction between semantic context and F0 contours, indicating that semantic context becomes more important for L2 speech recognition when F0 information is degraded. None of these effects were found to be modulated by L2 proficiency. The discrepancy in the effects of semantic context and F0 contours on L2 speech recognition in the interfering background might be related to differences in processing capacities required by the two types of information in adverse listening conditions. PMID:27378997

  7. Speech recognition materials and ceiling effects: considerations for cochlear implant programs.

    Science.gov (United States)

    Gifford, René H; Shallop, Jon K; Peterson, Anna Mary

    2008-01-01

    Cochlear implant recipients have demonstrated remarkable increases in speech perception since US FDA approval was granted in 1984. Improved performance is due to a number of factors including improved cochlear implant technology, evolving speech coding strategies, and individuals with increasingly more residual hearing receiving implants. Despite this evolution, the same recommendations for pre- and postimplant speech recognition testing have been in place for over 10 years in the United States. To determine whether new recommendations are warranted, speech perception performance was assessed for 156 adult, postlingually deafened implant recipients as well as 50 hearing aid users on monosyllabic word recognition (CNC) and sentence recognition in quiet (HINT and AzBio sentences) and in noise (BKB-SIN). Results demonstrated that for HINT sentences in quiet, 28% of the subjects tested achieved maximum performance of 100% correct and that scores did not agree well with monosyllables (CNC) or sentence recognition in noise (BKB-SIN). For a more difficult sentence recognition material (AzBio), only 0.7% of the subjects achieved 100% performance and scores were in much better agreement with monosyllables and sentence recognition in noise. These results suggest that more difficult materials are needed to assess speech perception performance of postimplant patients - and perhaps also for determining implant candidacy. PMID:18212519

  8. Noisy Speech Recognition Based on Integration/Selection of Multiple Noise Suppression Methods Using Noise GMMs

    Science.gov (United States)

    Kitaoka, Norihide; Hamaguchi, Souta; Nakagawa, Seiichi

    To achieve high recognition performance for a wide variety of noise and for a wide range of signal-to-noise ratio, this paper presents methods for integration of four noise reduction algorithms: spectral subtraction with smoothing of time direction, temporal domain SVD-based speech enhancement, GMM-based speech estimation and KLT-based comb-filtering. In this paper, we proposed two types of combination methods of noise suppression algorithms: selection of front-end processor and combination of results from multiple recognition processes. Recognition results on the CENSREC-1 task showed the effectiveness of our proposed methods.kn-abstract=

  9. Report generation using digital speech recognition in radiology.

    Science.gov (United States)

    Vorbeck, F; Ba-Ssalamah, A; Kettenbach, J; Huebsch, P

    2000-01-01

    The aim of this study was to evaluate whether the use of a digital continuous speech recognition (CSR) in the field of radiology could lead to relevant time savings in generating a report. A CSR system (SP6000, Philips, Eindhoven, The Netherlands) for German was used to transform fluently spoken sentences into text. Two radiologists dictated a total of 450 reports on five radiological topics. Two typists edited those reports by means of conventional typing using a text editor (WinWord 6.0, Microsoft, Redmond, Wash.) installed on an IBM-compatible personal computer (PC). The same reports were generated using the CSR system and the performance of both systems was then evaluated by comparing the time needed to generate the reports and the error rates of both systems. In addition, the error rate of the CSR system and the time needed to create the reports was evaluated. The mean error rate for the CSR system was 5.5%, and the mean error rate for conventional typing was 0.4%. Reports edited with the CSR, on average, were generated 19% faster compared with the conventional text-editing method. However, the amount of error rates and time savings were different and depended on topics, speakers, and typists. Using CSR the maximum time saving achieved was 28% for the topic sonography. The CSR system was never slower, under any circumstances, than conventional typing on a PC. When compared with a conventional manual typing method, the CSR system proved to be useful in a clinical setting and saved time in generating radiological reports. The amount of time saved, however, greatly depended on the performance of the typist, the speaker, and on stored vocabulary provided by the CSR system. PMID:11305581

  10. Report generation using digital speech recognition in radiology

    International Nuclear Information System (INIS)

    The aim of this study was to evaluate whether the use of a digital continuous speech recognition (CSR) in the field of radiology could lead to relevant time savings in generating a report. A CSR system (SP6000, Philips, Eindhoven, The Netherlands) for German was used to transform fluently spoken sentences into text. Two radiologists dictated a total of 450 reports on five radiological topics. Two typists edited those reports by means of conventional typing using a text editor (WinWord 6.0, Microsoft, Redmond, Wash.) installed on an IBM-compatible personal computer (PC). The same reports were generated using the CSR system and the performance of both systems was then evaluated by comparing the time needed to generate the reports and the error rates of both systems. In addition, the error rate of the CSR system and the time needed to create the reports was evaluated. The mean error rate for the CSR system was 5.5 %, and the mean error rate for conventional typing was 0.4 %. Reports edited with the CSR, on average, were generated 19 % faster compared with the conventional text-editing method. However, the amount of error rates and time savings were different and depended on topics, speakers, and typists. Using CSR the maximum time saving achieved was 28 % for the topic sonography. The CSR system was never slower, under any circumstances, than conventional typing on a PC. When compared with a conventional manual typing method, the CSR system proved to be useful in a clinical setting and saved time in generating radiological reports. The amount of time saved, however, greatly depended on the performance of the typist, the speaker, and on stored vocabulary provided by the CSR system. (orig.)

  11. Feature Fusion Algorithm for Multimodal Emotion Recognition from Speech and Facial Expression Signal

    Directory of Open Access Journals (Sweden)

    Han Zhiyan

    2016-01-01

    Full Text Available In order to overcome the limitation of single mode emotion recognition. This paper describes a novel multimodal emotion recognition algorithm, and takes speech signal and facial expression signal as the research subjects. First, fuse the speech signal feature and facial expression signal feature, get sample sets by putting back sampling, and then get classifiers by BP neural network (BPNN. Second, measure the difference between two classifiers by double error difference selection strategy. Finally, get the final recognition result by the majority voting rule. Experiments show the method improves the accuracy of emotion recognition by giving full play to the advantages of decision level fusion and feature level fusion, and makes the whole fusion process close to human emotion recognition more, with a recognition rate 90.4%.

  12. Objective automatic assessment of rehabilitative speech treatment in Parkinson's disease

    OpenAIRE

    Tsanas, A; Little, M.A.; Fox, C.; Ramig, L O

    2014-01-01

    Vocal performance degradation is a common symptom for the vast majority of Parkinson's disease (PD) subjects, who typically follow personalized one-to-one periodic rehabilitation meetings with speech experts over a long-term period. Recently, a novel computer program called Lee Silverman voice treatment (LSVT) Companion was developed to allow PD subjects to independently progress through a rehabilitative treatment session. This study is part of the assessment of the LSVT Companion, aiming to ...

  13. Automatic recognition of offensive team formation in american football plays

    KAUST Repository

    Atmosukarto, Indriyati

    2013-06-01

    Compared to security surveillance and military applications, where automated action analysis is prevalent, the sports domain is extremely under-served. Most existing software packages for sports video analysis require manual annotation of important events in the video. American football is the most popular sport in the United States, however most game analysis is still done manually. Line of scrimmage and offensive team formation recognition are two statistics that must be tagged by American Football coaches when watching and evaluating past play video clips, a process which takes many man hours per week. These two statistics are also the building blocks for more high-level analysis such as play strategy inference and automatic statistic generation. In this paper, we propose a novel framework where given an American football play clip, we automatically identify the video frame in which the offensive team lines in formation (formation frame), the line of scrimmage for that play, and the type of player formation the offensive team takes on. The proposed framework achieves 95% accuracy in detecting the formation frame, 98% accuracy in detecting the line of scrimmage, and up to 67% accuracy in classifying the offensive team\\'s formation. To validate our framework, we compiled a large dataset comprising more than 800 play-clips of standard and high definition resolution from real-world football games. This dataset will be made publicly available for future comparison. © 2013 IEEE.

  14. An Automatic Number Plate Recognition System under Image Processing

    Directory of Open Access Journals (Sweden)

    Sarbjit Kaur

    2016-03-01

    Full Text Available Automatic Number Plate Recognition system is an application of computer vision and image processing technology that takes photograph of vehicles as input image and by extracting their number plate from whole vehicle image , it display the number plate information into text. Mainly the ANPR system consists of 4 phases: - Acquisition of Vehicle Image and Pre-Processing, Extraction of Number Plate Area, Character Segmentation and Character Recognition. The overall accuracy and efficiency of whole ANPR system depends on number plate extraction phase as character segmentation and character recognition phases are also depend on the output of this phase. Further the accuracy of Number Plate Extraction phase depends on the quality of captured vehicle image. Higher be the quality of captured input vehicle image more will be the chances of proper extraction of vehicle number plate area. The existing methods of ANPR works well for dark and bright/light categories image but it does not work well for Low Contrast, Blurred and Noisy images and the detection of exact number plate area by using the existing ANPR approach is not successful even after applying existing filtering and enhancement technique for these types of images. Due to wrong extraction of number plate area, the character segmentation and character recognition are also not successful in this case by using the existing method. To overcome these drawbacks I proposed an efficient approach for ANPR in which the input vehicle image is pre-processed firstly by iterative bilateral filtering , adaptive histogram equalization and number plate is extracted from pre-processed vehicle image using morphological operations, image subtraction, image binarization/thresholding, sobel vertical edge detection and by boundary box analysis. Sometimes the extracted plate area also contains noise, bolts, frames etc. So the extracted plate area is enhanced by using morphological operations to improve the quality of

  15. Using the FASST source separation toolbox for noise robust speech recognition

    OpenAIRE

    Ozerov, Alexey; Vincent, Emmanuel

    2011-01-01

    We describe our submission to the 2011 CHiME Speech Separation and Recognition Challenge. Our speech separation algorithm was built using the Flexible Audio Source Separation Toolbox (FASST) we developed recently. This toolbox is an implementation of a general flexible framework based on a library of structured source models that enable the incorporation of prior knowledge about a source separation problem via user-specifiable constraints. We show how to use FASST to develop an efficient spee...

  16. Voice Activity Detector of Wake-Up-Word Speech Recognition System Design on FPGA

    Directory of Open Access Journals (Sweden)

    Veton Z. Këpuska

    2014-12-01

    Full Text Available A typical speech recognition system is push-to-talk operated that requires activation. However for those who use hands-busy applications, movement may by restricted or impossible. One alternative is to use Speech-Only Interface. The proposed method that is called Wake-Up-Word Speech Recognition (WUW-SR that utilizes speech only interface. A WUW-SR system would allow the user to activate systems (Cell phone, Computer, etc. with only speech commands instead of manual activation. The trend in WUW-SR hardware design is towards implementing a complete system on a single chip intended for various applications. This paper presents an experimental FPGA design and implementation of a novel architecture of a real time feature extraction processor that includes: Voice Activity Detector (VAD, and features extraction, MFCC, LPC, and ENH_MFCC. In the WUW-SR system, the recognizer front-end with VAD is located at the terminal which is typically connected over a data network(e.g., serverfor remote back-end recognition. VAD is responsible for segmenting the signal into speech-like and non-speech-like segments. For any given frame VAD reports one of two possible states: VAD_ON or VAD_OFF. The back-end is then responsible to score the features that are being segmented during VAD_ON stage. The most important characteristic of the presented design is that it should guarantee virtually 100% correct rejection for non-WUW (out of vocabulary words - OOV while maintaining correct acceptance rate of 99.9% or higher (in vocabulary words - INV. This requirement sets apart WUW-SR from other speech recognition tasks because no existing system can guarantee 100% reliability by any measure.

  17. Chinese Speech Recognition Model Based on Activation of the State Feedback Neural Network

    Institute of Scientific and Technical Information of China (English)

    李先志; 孙义和

    2001-01-01

    This paper proposes a simplified novel speech recognition model, the state feedback neuralnetwork activation model (SFNNAM), which is developed based on the characteristics of Chinese speechstructure. The model assumes that the current state of speech is only a correction of the last previous state.According to the "C-V"(Consonant-Vowel) structure of the Chinese language, a speech segmentation methodis also implemented in the SFNNAM model. This model has a definite physical meaning grounded on thestructure of the Chinese language and is easily implemented in very large scale integrated circuit (VLSI). In thespeech recognition experiment, less calculations were need than in the hidden Markov models (HMM) basedalgorithm. The recognition rate for Chinese numbers was 93.5% for the first candidate and 99.5% for the firsttwo candidates.``

  18. [Research on Barrier-free Home Environment System Based on Speech Recognition].

    Science.gov (United States)

    Zhu, Husheng; Yu, Hongliu; Shi, Ping; Fang, Youfang; Jian, Zhuo

    2015-10-01

    The number of people with physical disabilities is increasing year by year, and the trend of population aging is more and more serious. In order to improve the quality of the life, a control system of accessible home environment for the patients with serious disabilities was developed to control the home electrical devices with the voice of the patients. The control system includes a central control platform, a speech recognition module, a terminal operation module, etc. The system combines the speech recognition control technology and wireless information transmission technology with the embedded mobile computing technology, and interconnects the lamp, electronic locks, alarms, TV and other electrical devices in the home environment as a whole system through a wireless network node. The experimental results showed that speech recognition success rate was more than 84% in the home environment. PMID:26964305

  19. Hybrid Approach for Language Identification Oriented to Multilingual Speech Recognition in the Basque Context

    Science.gov (United States)

    Barroso, N.; de Ipiña, K. López; Ezeiza, A.; Barroso, O.; Susperregi, U.

    The development of Multilingual Large Vocabulary Continuous Speech Recognition systems involves issues as: Language Identification, Acoustic-Phonetic Decoding, Language Modelling or the development of appropriated Language Resources. The interest on Multilingual Systems arouses because there are three official languages in the Basque Country (Basque, Spanish, and French), and there is much linguistic interaction among them, even if Basque has very different roots than the other two languages. This paper describes the development of a Language Identification (LID) system oriented to robust Multilingual Speech Recognition for the Basque context. The work presents hybrid strategies for LID, based on the selection of system elements by Support Vector Machines and Multilayer Perceptron classifiers and stochastic methods for speech recognition tasks (Hidden Markov Models and n-grams).

  20. Frequency band-importance functions for auditory and auditory-visual speech recognition

    Science.gov (United States)

    Grant, Ken W.

    2005-04-01

    In many everyday listening environments, speech communication involves the integration of both acoustic and visual speech cues. This is especially true in noisy and reverberant environments where the speech signal is highly degraded, or when the listener has a hearing impairment. Understanding the mechanisms involved in auditory-visual integration is a primary interest of this work. Of particular interest is whether listeners are able to allocate their attention to various frequency regions of the speech signal differently under auditory-visual conditions and auditory-alone conditions. For auditory speech recognition, the most important frequency regions tend to be around 1500-3000 Hz, corresponding roughly to important acoustic cues for place of articulation. The purpose of this study is to determine the most important frequency region under auditory-visual speech conditions. Frequency band-importance functions for auditory and auditory-visual conditions were obtained by having subjects identify speech tokens under conditions where the speech-to-noise ratio of different parts of the speech spectrum is independently and randomly varied on every trial. Point biserial correlations were computed for each separate spectral region and the normalized correlations are interpreted as weights indicating the importance of each region. Relations among frequency-importance functions for auditory and auditory-visual conditions will be discussed.

  1. Emotional recognition from the speech signal for a virtual education agent

    Science.gov (United States)

    Tickle, A.; Raghu, S.; Elshaw, M.

    2013-06-01

    This paper explores the extraction of features from the speech wave to perform intelligent emotion recognition. A feature extract tool (openSmile) was used to obtain a baseline set of 998 acoustic features from a set of emotional speech recordings from a microphone. The initial features were reduced to the most important ones so recognition of emotions using a supervised neural network could be performed. Given that the future use of virtual education agents lies with making the agents more interactive, developing agents with the capability to recognise and adapt to the emotional state of humans is an important step.

  2. Dynamic HMM Model with Estimated Dynamic Property in Continuous Mandarin Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    CHENFeili; ZHUJie

    2003-01-01

    A new dynamic HMM (hiddem Markov model) has been introduced in this paper, which describes the relationship between dynamic property and feature of space. The method to estimate the dynamic property is discussed in this paper, which makes the dynamic HMMmuch more practical in real time speech recognition. Ex-periment on large vocabulary continuous Mandarin speech recognition task has shown that the dynamic HMM model can achieve about 10% of error reduction both for tonal and toneless syllable. Estimated dynamic property can achieve nearly same (even better) performance than using extracted dynamic property.

  3. Simultaneous Blind Separation and Recognition of Speech Mixtures Using Two Microphones to Control a Robot Cleaner

    OpenAIRE

    Heungkyu Lee

    2013-01-01

    This paper proposes a method for the simultaneous separation and recognition of speech mixtures in noisy environments using two‐channel based independent vector analysis (IVA) on a home‐robot cleaner. The issues to be considered in our target application are speech recognition at a distance and noise removal to cope with a variety of noises, including TV sounds, air conditioners, babble, and so on, that can occur in a house, where people can utter a voice command to control a robot cleaner at...

  4. Researches of the Electrotechnical Laboratory. No. 955: Speech recognition by description of acoustic characteristic variations

    Science.gov (United States)

    Hayamizu, Satoru

    1993-09-01

    A new speech recognition technique is proposed. This technique systematically describes acoustic characteristic variations using a large scale speech database, thereby, obtaining high recognition accuracy. Rules are extracted to represent knowledge concerning acoustic characteristic variations by observing the actual speech database. A general framework based on maps of the sets of variation factors to the acoustic feature spaces is proposed. A single recognition model is not used for each element of descriptive units regardless of the states of the variation factors. Large-scaled and systematic different recognition models are used for different states. A technique to structurize the representation of acoustic characteristic variations by clustering recognition models depending on variation factors is proposed. To investigate acoustic characteristic variations for phonetic contexts efficiently, word sets for reading texts of speech database are selected so that the maximum number of three phoneme sequences are covered in small number of words as possible. A selection algorithm, in which the first criterion is to maximize the number of different three phoneme sequences in the word set and the second criterion is to maximize the entropy of the three phonemes, is proposed. Read speed data of the word sets are collected and labelled as acoustic-phonetic segments. Experiments of speaker-independent word recognition using this speech database were conducted to show the description effectiveness of the acoustic characteristic variations using networks of acoustic-phonetic segments. The experiment shows the recognition errors are reduced. Basic framework for estimating the acoustic characteristics in unknown phonetic contexts using decision trees is proposed.

  5. Automatic analysis of slips of the tongue: Insights into the cognitive architecture of speech production.

    Science.gov (United States)

    Goldrick, Matthew; Keshet, Joseph; Gustafson, Erin; Heller, Jordana; Needle, Jeremy

    2016-04-01

    Traces of the cognitive mechanisms underlying speaking can be found within subtle variations in how we pronounce sounds. While speech errors have traditionally been seen as categorical substitutions of one sound for another, acoustic/articulatory analyses show they partially reflect the intended sound. When "pig" is mispronounced as "big," the resulting /b/ sound differs from correct productions of "big," moving towards intended "pig"-revealing the role of graded sound representations in speech production. Investigating the origins of such phenomena requires detailed estimation of speech sound distributions; this has been hampered by reliance on subjective, labor-intensive manual annotation. Computational methods can address these issues by providing for objective, automatic measurements. We develop a novel high-precision computational approach, based on a set of machine learning algorithms, for measurement of elicited speech. The algorithms are trained on existing manually labeled data to detect and locate linguistically relevant acoustic properties with high accuracy. Our approach is robust, is designed to handle mis-productions, and overall matches the performance of expert coders. It allows us to analyze a very large dataset of speech errors (containing far more errors than the total in the existing literature), illuminating properties of speech sound distributions previously impossible to reliably observe. We argue that this provides novel evidence that two sources both contribute to deviations in speech errors: planning processes specifying the targets of articulation and articulatory processes specifying the motor movements that execute this plan. These findings illustrate how a much richer picture of speech provides an opportunity to gain novel insights into language processing. PMID:26779665

  6. Speech recognition for the anaesthesia record during crisis scenarios

    DEFF Research Database (Denmark)

    Alapetite, Alexandre

    2008-01-01

    Introduction: This article describes the evaluation of a prototype speech-input interface to an anaesthesia patient record, conducted in a full-scale anaesthesia simulator involving six doctor-nurse anaesthetist teams. Objective: The aims of the experiment were, first, to assess the potential...... by a keyword; combination of command and free text modes); finally, to quantify some of the gains that could be provided by the speech input modality. Methods: Six anaesthesia teams composed of one doctor and one nurse were each confronted with two crisis scenarios in a full-scale anaesthesia...

  7. Predictors of aided speech recognition, with and without frequency compression, in older adults.

    OpenAIRE

    Ellis, Rachel J.; Munro, Kevin J.

    2015-01-01

    OBJECTIVE: The aim was to investigate whether cognitive and/or audiological measures predict aided speech recognition, both with and without frequency compression (FC). DESIGN: Participants wore hearing aids, with and without FC for a total of 12 weeks (six weeks in each signal processing condition, ABA design). Performance on a sentence-in-noise recognition test was assessed at the end of each six-week period. Audiological (severity of high frequency hearing loss, presence of dead regions) a...

  8. Implementation of Speech Recognition in Web Application for Sub Continental Language

    OpenAIRE

    Dilip Kumar; Abhishek Sachan; Malay Kumar

    2014-01-01

    Speech recognition is the aptitude of a mechanism in the web application to identify voice instructions agree with the pattern stored in the glossary. Mainly two concepts are summarized in this paper: Hindi voice conversion into text and searching that text in web application like Google. Currently, Hidden Markov Model (HMMs) is used for Hindi voice recognition and its Toolkit is accustomed to identify Hindi language. It recognizes the isolated words using acoustic word model. The system is t...

  9. Implementation of a Tour Guide Robot System Using RFID Technology and Viterbi Algorithm-Based HMM for Speech Recognition

    Directory of Open Access Journals (Sweden)

    Neng-Sheng Pai

    2014-01-01

    Full Text Available This paper applied speech recognition and RFID technologies to develop an omni-directional mobile robot into a robot with voice control and guide introduction functions. For speech recognition, the speech signals were captured by short-time processing. The speaker first recorded the isolated words for the robot to create speech database of specific speakers. After the speech pre-processing of this speech database, the feature parameters of cepstrum and delta-cepstrum were obtained using linear predictive coefficient (LPC. Then, the Hidden Markov Model (HMM was used for model training of the speech database, and the Viterbi algorithm was used to find an optimal state sequence as the reference sample for speech recognition. The trained reference model was put into the industrial computer on the robot platform, and the user entered the isolated words to be tested. After processing by the same reference model and comparing with previous reference model, the path of the maximum total probability in various models found using the Viterbi algorithm in the recognition was the recognition result. Finally, the speech recognition and RFID systems were achieved in an actual environment to prove its feasibility and stability, and implemented into the omni-directional mobile robot.

  10. Semi-automatic recognition of marine debris on beaches

    Science.gov (United States)

    Ge, Zhenpeng; Shi, Huahong; Mei, Xuefei; Dai, Zhijun; Li, Daoji

    2016-05-01

    An increasing amount of anthropogenic marine debris is pervading the earth’s environmental systems, resulting in an enormous threat to living organisms. Additionally, the large amount of marine debris around the world has been investigated mostly through tedious manual methods. Therefore, we propose the use of a new technique, light detection and ranging (LIDAR), for the semi-automatic recognition of marine debris on a beach because of its substantially more efficient role in comparison with other more laborious methods. Our results revealed that LIDAR should be used for the classification of marine debris into plastic, paper, cloth and metal. Additionally, we reconstructed a 3-dimensional model of different types of debris on a beach with a high validity of debris revivification using LIDAR-based individual separation. These findings demonstrate that the availability of this new technique enables detailed observations to be made of debris on a large beach that was previously not possible. It is strongly suggested that LIDAR could be implemented as an appropriate monitoring tool for marine debris by global researchers and governments.

  11. Gaussian process classification using automatic relevance determination for SAR target recognition

    Science.gov (United States)

    Zhang, Xiangrong; Gou, Limin; Hou, Biao; Jiao, Licheng

    2010-10-01

    In this paper, a Synthetic Aperture Radar Automatic Target Recognition approach based on Gaussian process (GP) classification is proposed. It adopts kernel principal component analysis to extract sample features and implements target recognition by using GP classification with automatic relevance determination (ARD) function. Compared with k-Nearest Neighbor, Naïve Bayes classifier and Support Vector Machine, GP with ARD has the advantage of automatic model selection and hyper-parameter optimization. The experiments on UCI datasets and MSTAR database show that our algorithm is self-tuning and has better recognition accuracy as well.

  12. Temporal acuity and speech recognition score in noise in patients with multiple sclerosis

    Directory of Open Access Journals (Sweden)

    Mehri Maleki

    2014-04-01

    Full Text Available Background and Aim: Multiple sclerosis (MS is one of the central nervous system diseases can be associated with a variety of symptoms such as hearing disorders. The main consequence of hearing loss is poor speech perception, and temporal acuity has important role in speech perception. We evaluated the speech perception in silent and in the presence of noise and temporal acuity in patients with multiple sclerosis.Methods: Eighteen adults with multiple sclerosis with the mean age of 37.28 years and 18 age- and sex- matched controls with the mean age of 38.00 years participated in this study. Temporal acuity and speech perception were evaluated by random gap detection test (GDT and word recognition score (WRS in three different signal to noise ratios.Results: Statistical analysis of test results revealed significant differences between the two groups (p<0.05. Analysis of gap detection test (in 4 sensation levels and word recognition score in both groups showed significant differences (p<0.001.Conclusion: According to this survey, the ability of patients with multiple sclerosis to process temporal features of stimulus was impaired. It seems that, this impairment is important factor to decrease word recognition score and speech perception.

  13. ANALYSIS OF MULTIMODAL FUSION TECHNIQUES FOR AUDIO-VISUAL SPEECH RECOGNITION

    Directory of Open Access Journals (Sweden)

    D.V. Ivanko

    2016-05-01

    Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.

  14. AUTOMATIC RECOGNITION OF BOTH INTER AND INTRA CLASSES OF DIGITAL MODULATED SIGNALS USING ARTIFICIAL NEURAL NETWORK

    Directory of Open Access Journals (Sweden)

    JIDE JULIUS POPOOLA

    2014-04-01

    Full Text Available In radio communication systems, signal modulation format recognition is a significant characteristic used in radio signal monitoring and identification. Over the past few decades, modulation formats have become increasingly complex, which has led to the problem of how to accurately and promptly recognize a modulation format. In addressing these challenges, the development of automatic modulation recognition systems that can classify a radio signal’s modulation format has received worldwide attention. Decision-theoretic methods and pattern recognition solutions are the two typical automatic modulation recognition approaches. While decision-theoretic approaches use probabilistic or likelihood functions, pattern recognition uses feature-based methods. This study applies the pattern recognition approach based on statistical parameters, using an artificial neural network to classify five different digital modulation formats. The paper deals with automatic recognition of both inter-and intra-classes of digitally modulated signals in contrast to most of the existing algorithms in literature that deal with either inter-class or intra-class modulation format recognition. The results of this study show that accurate and prompt modulation recognition is possible beyond the lower bound of 5 dB commonly acclaimed in literature. The other significant contribution of this paper is the usage of the Python programming language which reduces computational complexity that characterizes other automatic modulation recognition classifiers developed using the conventional MATLAB neural network toolbox.

  15. Phonotactics Constraints and the Spoken Word Recognition of Chinese Words in Speech

    Science.gov (United States)

    Yip, Michael C.

    2016-01-01

    Two word-spotting experiments were conducted to examine the question of whether native Cantonese listeners are constrained by phonotactics information in spoken word recognition of Chinese words in speech. Because no legal consonant clusters occurred within an individual Chinese word, this kind of categorical phonotactics information of Chinese…

  16. The Affordance of Speech Recognition Technology for EFL Learning in an Elementary School Setting

    Science.gov (United States)

    Liaw, Meei-Ling

    2014-01-01

    This study examined the use of speech recognition (SR) technology to support a group of elementary school children's learning of English as a foreign language (EFL). SR technology has been used in various language learning contexts. Its application to EFL teaching and learning is still relatively recent, but a solid understanding of its…

  17. Comparative Evaluation of Three Continuous Speech Recognition Software Packages in the Generation of Medical Reports

    OpenAIRE

    Devine, Eric G.; Gaehde, Stephan A.; Curtis, Arthur C.

    2000-01-01

    Objective: To compare out-of-box performance of three commercially available continuous speech recognition software packages: IBM ViaVoice 98 with General Medicine Vocabulary; Dragon Systems NaturallySpeaking Medical Suite, version 3.0; and L&H Voice Xpress for Medicine, General Medicine Edition, version 1.2.

  18. Audio-Visual Tibetan Speech Recognition Based on a Deep Dynamic Bayesian Network for Natural Human Robot Interaction

    Directory of Open Access Journals (Sweden)

    Yue Zhao

    2012-12-01

    Full Text Available Audio‐visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi‐stream Dynamic Bayesian Network and coupled HMM are widely used for audio‐visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN to perform unsupervised extraction of spatial‐temporal multimodal features from Tibetan audio‐visual speech data and build an accurate audio‐visual speech recognition model under a no frame‐independency assumption. The experiment results on Tibetan speech data from some real‐world environments showed the proposed DDBN outperforms the state‐of‐art methods in word recognition accuracy.

  19. Integrating Stress Information in Large Vocabulary Continuous Speech Recognition

    OpenAIRE

    Ludusan, Bogdan; Ziegler, Stefan; Gravier, Guillaume

    2012-01-01

    In this paper we propose a novel method for integrating stress information in the decoding step of a speech recognizer. A multiscale rhythm model was used to determine the stress scores for each syllable, which are further used to reinforce paths during search. Two strategies for integrating the stress were employed: the first one reinforces paths through all the syllables with a value proportional to the their stress score, while the second one enhances paths passing only through stressed sy...

  20. High-level Approaches to Confidence Estimation in Speech Recognition

    OpenAIRE

    S. J. Cox; Dasmahapatra, S.

    2002-01-01

    We describe some high-level approaches to es-timating confidence scores for the words output by a speech recognizer. By "high-level" we mean that the proposed measuresdo not rely on decoder specific "side information" and so should find more general applicability than measures that have beendeveloped for specific recognizers. Our main approach is to attempt to decouple the language modeling and acoustic modelingin the recognizer in order to generate independent information from these two sour...

  1. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech.

    Science.gov (United States)

    Bremner, Paul; Leonards, Ute

    2016-01-01

    Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realized remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances. PMID:26925010

  2. Iconic Gestures for Robot Avatars, Recognition and Integration with Speech

    Directory of Open Access Journals (Sweden)

    Paul Adam Bremner

    2016-02-01

    Full Text Available Co-verbal gestures are an important part of human communication, improving its efficiency and efficacy for information conveyance. One possible means by which such multi-modal communication might be realised remotely is through the use of a tele-operated humanoid robot avatar. Such avatars have been previously shown to enhance social presence and operator salience. We present a motion tracking based tele-operation system for the NAO robot platform that allows direct transmission of speech and gestures produced by the operator. To assess the capabilities of this system for transmitting multi-modal communication, we have conducted a user study that investigated if robot-produced iconic gestures are comprehensible, and are integrated with speech. Robot performed gesture outcomes were compared directly to those for gestures produced by a human actor, using a within participant experimental design. We show that iconic gestures produced by a tele-operated robot are understood by participants when presented alone, almost as well as when produced by a human. More importantly, we show that gestures are integrated with speech when presented as part of a multi-modal communication equally well for human and robot performances.

  3. Application of neural networks to speech recognition. Onsei ninshiki eno neural net oyo

    Energy Technology Data Exchange (ETDEWEB)

    Nitta, T.; Masai, Y. (Toshiba Corp., Tokyo (Japan))

    1991-12-01

    The application of neural networks to speech recognition was reviewed which has produced good results, in particular, in the fields of phoneme recognition and small vocabulary spoken word recognition. Because the implementation of a training process is unnecessary in speaker-independent speech recognizers, a neural-based speech recognition board can be easily realized with a digital signal processor (DSP). While since a standard multilayer neural network cannot incorporate large pattern variabilities on a time axis, the hybrid algorithm composed of the neural network and a hidden Markov model (HMM) is expected as the most hopeful solution. The speaker-independent large vocabulary spoken word recognition is given as an example in which phoneme is recognized by the neural network, and such phoneme series is identified with that of each word by HMM. In addition, since a recurrent neural network can represent the transition between states by itself, the same process as HMM is expected to be realized through such network. 16 refs., 3 figs.

  4. One-against-all weighted dynamic time warping for language-independent and speaker-dependent speech recognition in adverse conditions.

    Directory of Open Access Journals (Sweden)

    Xianglilan Zhang

    Full Text Available Considering personal privacy and difficulty of obtaining training material for many seldom used English words and (often non-English names, language-independent (LI with lightweight speaker-dependent (SD automatic speech recognition (ASR is a promising option to solve the problem. The dynamic time warping (DTW algorithm is the state-of-the-art algorithm for small foot-print SD ASR applications with limited storage space and small vocabulary, such as voice dialing on mobile devices, menu-driven recognition, and voice control on vehicles and robotics. Even though we have successfully developed two fast and accurate DTW variations for clean speech data, speech recognition for adverse conditions is still a big challenge. In order to improve recognition accuracy in noisy environment and bad recording conditions such as too high or low volume, we introduce a novel one-against-all weighted DTW (OAWDTW. This method defines a one-against-all index (OAI for each time frame of training data and applies the OAIs to the core DTW process. Given two speech signals, OAWDTW tunes their final alignment score by using OAI in the DTW process. Our method achieves better accuracies than DTW and merge-weighted DTW (MWDTW, as 6.97% relative reduction of error rate (RRER compared with DTW and 15.91% RRER compared with MWDTW are observed in our extensive experiments on one representative SD dataset of four speakers' recordings. To the best of our knowledge, OAWDTW approach is the first weighted DTW specially designed for speech data in adverse conditions.

  5. Speech variability effects on recognition accuracy associated with concurrent task performance by pilots

    Science.gov (United States)

    Simpson, C. A.

    1985-01-01

    In the present study of the responses of pairs of pilots to aircraft warning classification tasks using an isolated word, speaker-dependent speech recognition system, the induced stress was manipulated by means of different scoring procedures for the classification task and by the inclusion of a competitive manual control task. Both speech patterns and recognition accuracy were analyzed, and recognition errors were recorded by type for an isolated word speaker-dependent system and by an offline technique for a connected word speaker-dependent system. While errors increased with task loading for the isolated word system, there was no such effect for task loading in the case of the connected word system.

  6. Automatic evaluation of speech rhythm instability and acceleration in dysarthrias associated with basal ganglia dysfunction

    Directory of Open Access Journals (Sweden)

    Jan eRusz

    2015-07-01

    Full Text Available Speech rhythm abnormalities are commonly present in patients with different neurodegenerative disorders. These alterations are hypothesized to be a consequence of disruption to the basal ganglia circuitry involving dysfunction of motor planning, programming and execution, which can be detected by a syllable repetition paradigm. Therefore, the aim of the present study was to design a robust signal processing technique that allows the automatic detection of spectrally-distinctive nuclei of syllable vocalizations and to determine speech features that represent rhythm instability and acceleration. A further aim was to elucidate specific patterns of dysrhythmia across various neurodegenerative disorders that share disruption of basal ganglia function. Speech samples based on repetition of the syllable /pa/ at a self-determined steady pace were acquired from 109 subjects, including 22 with Parkinson's disease (PD, 11 progressive supranuclear palsy (PSP, 9 multiple system atrophy (MSA, 24 ephedrone-induced parkinsonism (EP, 20 Huntington's disease (HD, and 23 healthy controls. Subsequently, an algorithm for the automatic detection of syllables as well as features representing rhythm instability and rhythm acceleration were designed. The proposed detection algorithm was able to correctly identify syllables and remove erroneous detections due to excessive inspiration and nonspeech sounds with a very high accuracy of 99.6%. Instability of vocal pace performance was observed in PSP, MSA, EP and HD groups. Significantly increased pace acceleration was observed only in the PD group. Although not significant, a tendency for pace acceleration was observed also in the PSP and MSA groups. Our findings underline the crucial role of the basal ganglia in the execution and maintenance of automatic speech motor sequences. We envisage the current approach to become the first step towards the development of acoustic technologies allowing automated assessment of rhythm

  7. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR

    OpenAIRE

    Weninger, Felix; Erdogan, Hakan; Watanabe, Shinji; Vincent, Emmanuel; Le Roux, Jonathan; Hershey, John R.; Schuller, Björn

    2015-01-01

    We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM speech enhancement, even when used ' na¨vely ' as front-end processing, delivers competitive results on the CHiME-2 speech recognition task. Furtherm...

  8. Simultaneous Blind Separation and Recognition of Speech Mixtures Using Two Microphones to Control a Robot Cleaner

    Directory of Open Access Journals (Sweden)

    Heungkyu Lee

    2013-02-01

    Full Text Available This paper proposes a method for the simultaneous separation and recognition of speech mixtures in noisy environments using two‐channel based independent vector analysis (IVA on a home‐robot cleaner. The issues to be considered in our target application are speech recognition at a distance and noise removal to cope with a variety of noises, including TV sounds, air conditioners, babble, and so on, that can occur in a house, where people can utter a voice command to control a robot cleaner at any time and at any location, even while a robot cleaner is moving. Thus, the system should always be in a recognition‐ready state to promptly recognize a spoken word at any time, and the false acceptance rate should be lower. To cope with these issues, the keyword spotting technique is applied. In addition, a microphone alignment method and a model‐based real‐time IVA approach are proposed to effectively and simultaneously process the speech and noise sources, as well as to cover 360‐degree directions irrespective of distance. From the experimental evaluations, we show that the proposed method is robust in terms of speech recognition accuracy, even when the speaker location is unfixed and changes all the time. In addition, the proposed method shows good performance in severely noisy environments.

  9. Lip Localization and Viseme Classification for Visual Speech Recognition

    OpenAIRE

    Werda, Salah; Mahdi, Walid; Hamadou, Abdelmajid Ben

    2013-01-01

    The need for an automatic lip-reading system is ever increasing. Infact, today, extraction and reliable analysis of facial movements make up an important part in many multimedia systems such as videoconference, low communication systems, lip-reading systems. In addition, visual information is imperative among people with special needs. We can imagine, for example, a dependent person ordering a machine with an easy lip movement or by a simple syllable pronunciation. Moreover, people with heari...

  10. A Factored Language Model for Prosody Dependent Speech Recognition

    OpenAIRE

    Chen, Ken; Hasegawa-Johnson, Mark A.; Cole, Jennifer S.

    2007-01-01

    In this chapter, we proposed a novel approach that improves the robustness of prosody dependent language modeling by leveraging the dependence between prosody and syntax. In our experiments on Radio News Corpus, a factorial prosody dependent language model estimated using our proposed approach has achieved as much as 31% reduction of the joint perplexity over a prosody dependent language model estimated using the standard Maximum Likelihood approach. In recognition experiments, our approach r...

  11. Real-time contrast enhancement to improve speech recognition.

    Science.gov (United States)

    Alexander, Joshua M; Jenison, Rick L; Kluender, Keith R

    2011-01-01

    An algorithm that operates in real-time to enhance the salient features of speech is described and its efficacy is evaluated. The Contrast Enhancement (CE) algorithm implements dynamic compressive gain and lateral inhibitory sidebands across channels in a modified winner-take-all circuit, which together produce a form of suppression that sharpens the dynamic spectrum. Normal-hearing listeners identified spectrally smeared consonants (VCVs) and vowels (hVds) in quiet and in noise. Consonant and vowel identification, especially in noise, were improved by the processing. The amount of improvement did not depend on the degree of spectral smearing or talker characteristics. For consonants, when results were analyzed according to phonetic feature, the most consistent improvement was for place of articulation. This is encouraging for hearing aid applications because confusions between consonants differing in place are a persistent problem for listeners with sensorineural hearing loss. PMID:21949736

  12. Low-Complexity Variable Frame Rate Analysis for Speech Recognition and Voice Activity Detection

    DEFF Research Database (Denmark)

    Tan, Zheng-Hua; Lindberg, Børge

    2010-01-01

    present a low-complexity and effective frame selection approach based on a posteriori signal-to-noise ratio (SNR) weighted energy distance: The use of an energy distance, instead of e.g. a standard cepstral distance, makes the approach computationally efficient and enables fine granularity search......, and the use of a posteriori SNR weighting emphasizes the reliable regions in noisy speech signals. It is experimentally found that the approach is able to assign a higher frame rate to fast changing events such as consonants, a lower frame rate to steady regions like vowels and no frames to silence, even...... for very low SNR signals. The resulting variable frame rate analysis method is applied to three speech processing tasks that are essential to natural interaction with intelligent environments. First, it is used for improving speech recognition performance in noisy environments. Secondly, the method is used...

  13. The role of speech in the user interface : perspective and application

    OpenAIRE

    Abewusi, A.B.

    1994-01-01

    Consideration must be given to the implication of speech as a communication medium before deciding to use speech input or output in an interactive environment. There are several effective control strategies for improving the quality of speech. The utility of the speech has been demonstrated by application to several illustrative problems where their application has proved effective despite all the limitation of synthetic speech output and automatic speech recognition systems. (Résumé d'auteur)

  14. Performance Evaluation of Speech Recognition Systems as a Next-Generation Pilot-Vehicle Interface Technology

    Science.gov (United States)

    Arthur, Jarvis J., III; Shelton, Kevin J.; Prinzel, Lawrence J., III; Bailey, Randall E.

    2016-01-01

    During the flight trials known as Gulfstream-V Synthetic Vision Systems Integrated Technology Evaluation (GV-SITE), a Speech Recognition System (SRS) was used by the evaluation pilots. The SRS system was intended to be an intuitive interface for display control (rather than knobs, buttons, etc.). This paper describes the performance of the current "state of the art" Speech Recognition System (SRS). The commercially available technology was evaluated as an application for possible inclusion in commercial aircraft flight decks as a crew-to-vehicle interface. Specifically, the technology is to be used as an interface from aircrew to the onboard displays, controls, and flight management tasks. A flight test of a SRS as well as a laboratory test was conducted.

  15. A Log—Index Weighted Cepstral Distance Measure for Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    郑方; 吴文虎; 等

    1997-01-01

    A log-index weighted cepstral distance measure is proposed and tested in speacker-independent and speaker-dependent isolated word recognition systems using statistic techniques.The weights for the cepstral coefficients of this measure equal the logarithm of the corresponding indices.The experimental results show that this kind of measure works better than any other weighted Euclidean cepstral distance measures on three speech databases.The error rate obtained using this measure is about 1.8 percent for three databases on average,which is a 25% reduction from that obtained using other measures,and a 40% reduction from that obtained using Log Likelihood Ratio(LLR)measure.The experimental results also show that this kind of distance measure woks well in both speaker-dependent and speaker-independent speech recognition systems.

  16. Robust Features for Speech Recognition using Temporal Filtering Technique in the Presence of Impulsive Noise

    Directory of Open Access Journals (Sweden)

    Hajer Rahali

    2014-10-01

    Full Text Available In this paper we introduce a robust feature extractor, dubbed as Modified Function Cepstral Coefficients (MODFCC, based on gammachirp filterbank, Relative Spectral (RASTA and Autoregressive Moving-Average (ARMA filter. The goal of this work is to improve the robustness of speech recognition systems in additive noise and real-time reverberant environments. In speech recognition systems Mel-Frequency Cepstral Coefficients (MFCC, RASTA and ARMA Frequency Cepstral Coefficients (RASTA-MFCC and ARMA-MFCC are the three main techniques used. It will be shown in this paper that it presents some modifications to the original MFCC method. In our work the effectiveness of proposed changes to MFCC were tested and compared against the original RASTA-MFCC and ARMA-MFCC features. The prosodic features such as jitter and shimmer are added to baseline spectral features. The above-mentioned techniques were tested with impulsive signals under various noisy conditions within AURORA databases.

  17. Tone model integration based on discriminative weight training for Putonghua speech recognition

    Institute of Scientific and Technical Information of China (English)

    HUANG Hao; ZHU Jie

    2008-01-01

    A discriminative framework of tone model integration in continuous speech recognition was proposed. The method uses model dependent weights to scale probabilities of the hidden Markov models based on spectral features and tone models based on tonal features.The weights are discriminatively trahined by minimum phone error criterion. Update equation of the model weights based on extended Baum-Welch algorithm is derived. Various schemes of model weight combination are evaluated and a smoothing technique is introduced to make training robust to over fitting. The proposed method is ewluated on tonal syllable output and character output speech recognition tasks. The experimental results show the proposed method has obtained 9.5% and 4.7% relative error reduction than global weight on the two tasks due to a better interpolation of the given models. This proves the effectiveness of discriminative trained model weights for tone model integration.

  18. Audio-Visual Speech Recognition Using Lip Information Extracted from Side-Face Images

    Directory of Open Access Journals (Sweden)

    Koji Iwano

    2007-03-01

    Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.

  19. Combined Acoustic and Pronunciation Modelling for Non-Native Speech Recognition

    CERN Document Server

    Bouselmi, Ghazi; Illina, Irina

    2007-01-01

    In this paper, we present several adaptation methods for non-native speech recognition. We have tested pronunciation modelling, MLLR and MAP non-native pronunciation adaptation and HMM models retraining on the HIWIRE foreign accented English speech database. The ``phonetic confusion'' scheme we have developed consists in associating to each spoken phone several sequences of confused phones. In our experiments, we have used different combinations of acoustic models representing the canonical and the foreign pronunciations: spoken and native models, models adapted to the non-native accent with MAP and MLLR. The joint use of pronunciation modelling and acoustic adaptation led to further improvements in recognition accuracy. The best combination of the above mentioned techniques resulted in a relative word error reduction ranging from 46% to 71%.

  20. Coordinated control of an intelligent wheelchair based on a brain-computer interface and speech recognition

    Institute of Scientific and Technical Information of China (English)

    Hong-tao WANG; Yuan-qing LI; Tian-you YU

    2014-01-01

    An intelligent wheelchair is devised, which is controlled by a coordinated mechanism based on a brain-computer interface (BCI) and speech recognition. By performing appropriate activities, users can navigate the wheelchair with four steering behaviors (start, stop, turn left, and turn right). Five healthy subjects participated in an indoor experiment. The results demonstrate the efficiency of the coordinated control mechanism with satisfactory path and time optimality ratios, and show that speech recognition is a fast and accurate supplement for BCI-based control systems. The proposed intelligent wheelchair is especially suitable for patients suffering from paralysis (especially those with aphasia) who can learn to pronounce only a single sound (e.g.,‘ah’).

  1. Feeling backwards? How temporal order in speech affects the time course of vocal emotion recognition

    Directory of Open Access Journals (Sweden)

    SimonRigoulot

    2013-06-01

    Full Text Available Recent studies suggest that the time course for recognizing vocal expressions of basic emotion in speech varies significantly by emotion type, implying that listeners uncover acoustic evidence about emotions at different rates in speech (e.g., fear is recognized most quickly whereas happiness and disgust are recognized relatively slowly, Pell and Kotz, 2011. To investigate whether vocal emotion recognition is largely dictated by the amount of time listeners are exposed to speech or the position of critical emotional cues in the utterance, 40 English participants judged the meaning of emotionally-inflected pseudo-utterances presented in a gating paradigm, where utterances were gated as a function of their syllable structure in segments of increasing duration from the end of the utterance (i.e., gated ‘backwards’. Accuracy for detecting six target emotions in each gate condition and the mean identification point for each emotion in milliseconds were analyzed and compared to results from Pell & Kotz (2011. We again found significant emotion-specific differences in the time needed to accurately recognize emotions from speech prosody, and new evidence that utterance-final syllables tended to facilitate listeners’ accuracy in many conditions when compared to utterance-initial syllables. The time needed to recognize fear, anger, sadness, and neutral from speech cues was not influenced by how utterances were gated, although happiness and disgust were recognized significantly faster when listeners heard the end of utterances first. Our data provide new clues about the relative time course for recognizing vocally-expressed emotions within the 400-1200 millisecond time window, while highlighting that emotion recognition from prosody can be shaped by the temporal properties of speech.

  2. A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems

    Science.gov (United States)

    Vlaj, Damjan; Kotnik, Bojan; Horvat, Bogomir; Kačič, Zdravko

    2005-12-01

    This paper presents a novel computationally efficient voice activity detection (VAD) algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR) systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of the signal are transmitted. A similar objective can be adopted in DSR systems, where the nonspeech parameters are not sent over the transmission channel. A novel approach is proposed for VAD decisions based on mel-filter bank (MFB) outputs with the so-called Hangover criterion. Comparative tests are presented between the presented MFB VAD algorithm and three VAD algorithms used in the G.729, G.723.1, and DSR (advanced front-end) Standards. These tests were made on the Aurora 2 database, with different signal-to-noise (SNRs) ratios. In the speech recognition tests, the proposed MFB VAD outperformed all the three VAD algorithms used in the standards by [InlineEquation not available: see fulltext.] relative (G.723.1 VAD), by [InlineEquation not available: see fulltext.] relative (G.729 VAD), and by [InlineEquation not available: see fulltext.] relative (DSR VAD) in all SNRs.

  3. A Computationally Efficient Mel-Filter Bank VAD Algorithm for Distributed Speech Recognition Systems

    Directory of Open Access Journals (Sweden)

    Vlaj Damjan

    2005-01-01

    Full Text Available This paper presents a novel computationally efficient voice activity detection (VAD algorithm and emphasizes the importance of such algorithms in distributed speech recognition (DSR systems. When using VAD algorithms in telecommunication systems, the required capacity of the speech transmission channel can be reduced if only the speech parts of the signal are transmitted. A similar objective can be adopted in DSR systems, where the nonspeech parameters are not sent over the transmission channel. A novel approach is proposed for VAD decisions based on mel-filter bank (MFB outputs with the so-called Hangover criterion. Comparative tests are presented between the presented MFB VAD algorithm and three VAD algorithms used in the G.729, G.723.1, and DSR (advanced front-end Standards. These tests were made on the Aurora 2 database, with different signal-to-noise (SNRs ratios. In the speech recognition tests, the proposed MFB VAD outperformed all the three VAD algorithms used in the standards by relative (G.723.1 VAD, by relative (G.729 VAD, and by relative (DSR VAD in all SNRs.

  4. Searching for sources of variance in speech recognition: Young adults with normal hearing

    Science.gov (United States)

    Watson, Charles S.; Kidd, Gary R.

    2005-04-01

    In the present investigation, sensory-perceptual abilities of one thousand young adults with normal hearing are being evaluated with a range of auditory, visual, and cognitive measures. Four auditory measures were derived from factor-analytic analyses of previous studies with 18-20 speech and non-speech variables [G. R. Kidd et al., J. Acoust. Soc. Am. 108, 2641 (2000)]. Two measures of visual acuity are obtained to determine whether variation in sensory skills tends to exist primarily within or across sensory modalities. A working memory test, grade point average, and Scholastic Aptitude Test scores (Verbal and Quantitative) are also included. Preliminary multivariate analyses support previous studies of individual differences in auditory abilities (e.g., A. M. Surprenant and C. S. Watson, J. Acoust. Soc. Am. 110, 2085-2095 (2001)] which found that spectral and temporal resolving power obtained with pure tones and more complex unfamiliar stimuli have little or no correlation with measures of speech recognition under difficult listening conditions. The current findings show that visual acuity, working memory, and intellectual measures are also very poor predictors of speech recognition ability, supporting the independence of this processing skill. Remarkable performance by some exceptional listeners will be described. [Work supported by the Office of Naval Research, Award No. N000140310644.

  5. Managing predefined templates and macros for a departmental speech recognition system using common software

    OpenAIRE

    Sistrom, Chris L.; Honeyman, Janice C.; Mancuso, Anthony; Quisling, Ronald G.

    2001-01-01

    The authors have developed a networked database system to create, store, and manage predefined radiology report definitions. This was prompted by complete departmental conversion to a computer speech recognition system (SRS) for clinical reporting. The software complements and extends the capabilities of the SRS, and 2 systems are integrated by means of a simple text file format and import/export functions within each program. This report describes the functional requirements, design consider...

  6. Authenticity affects the recognition of emotions in speech: behavioral and fMRI evidence

    OpenAIRE

    Drolet, Matthis; Schubotz, Ricarda I.; Fischer, Julia

    2011-01-01

    The aim of the present study was to determine how authenticity of emotion expression in speech modulates activity in the neuronal substrates involved in emotion recognition. Within an fMRI paradigm, participants judged either the authenticity (authentic or play acted) or emotional content (anger, fear, joy, or sadness) of recordings of spontaneous emotions and reenactments by professional actors. When contrasting between task types, active judgment of authenticity, more than active judgment o...

  7. Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition

    OpenAIRE

    SAK, Haşim; Senior, Andrew; Rao, Kanishka; Beaufays, Françoise

    2015-01-01

    We have recently shown that deep Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) outperform feed forward deep neural networks (DNNs) as acoustic models for speech recognition. More recently, we have shown that the performance of sequence trained context dependent (CD) hidden Markov model (HMM) acoustic models using such LSTM RNNs can be equaled by sequence trained phone models initialized with connectionist temporal classification (CTC). In this paper, we present techniques tha...

  8. Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition

    OpenAIRE

    Li, Xiangang; Wu, Xihong

    2014-01-01

    Long short-term memory (LSTM) based acoustic modeling methods have recently been shown to give state-of-the-art performance on some speech recognition tasks. To achieve a further performance improvement, in this research, deep extensions on LSTM are investigated considering that deep hierarchical model has turned out to be more efficient than a shallow one. Motivated by previous research on constructing deep recurrent neural networks (RNNs), alternative deep LSTM architectures are proposed an...

  9. A SPEECH RECOGNITION METHOD USING COMPETITIVE AND SELECTIVE LEARNING NEURAL NETWORKS

    Institute of Scientific and Technical Information of China (English)

    2000-01-01

    On the basis of asymptotic theory of Gersho, the isodistortion principle of vector clustering was discussed and a kind of competitive and selective learning method (CSL) which may avoid local optimization and have excellent result in application to clusters of HMM model was also proposed. In combining the parallel, self-organizational hierarchical neural networks (PSHNN) to reclassify the scores of every form output by HMM, the CSL speech recognition rate is obviously elevated.

  10. Audiovisual benefit for recognition of speech presented with single-talker noise in older listeners

    OpenAIRE

    Jesse, A.; Janse, E.

    2012-01-01

    Older listeners are more affected than younger listeners in their recognition of speech in adverse conditions, such as when they also hear a single-competing speaker. In the present study, we investigated with a speeded response task whether older listeners with various degrees of hearing loss benefit under such conditions from also seeing the speaker they intend to listen to. We also tested, at the same time, whether older adults need postperceptual processing to obtain an audiovisual benefi...

  11. Visual Word Recognition is Accompanied by Covert Articulation: Evidence for a Speech-like Phonological Representation

    OpenAIRE

    Eiter, Brianna M.; INHOFF, ALBRECHT W.

    2008-01-01

    Two lexical decision task (LDT) experiments examined whether visual word recognition involves the use of a speech-like phonological code that may be generated via covert articulation. In Experiment 1, each visual item was presented with an irrelevant spoken word (ISW) that was either phonologically identical, similar, or dissimilar to it. An ISW delayed classification of a visual word when the two were phonologically similar, and it delayed the classification of a pseudoword when it was ident...

  12. Hidden Markov Model for Speech Recognition Using Modified Forward-Backward Re-estimation Algorithm

    OpenAIRE

    Balwant A. Sonkamble

    2012-01-01

    There are various kinds of practical implementation issues for the HMM. The use of scaling factor is the main issue in HMM implementation. The scaling factor is used for obtaining smoothened probabilities. The proposed technique called Modified Forward-Backward Re-estimation algorithm used to recognize speech patterns. The proposed algorithm has shown very good recognition accuracy as compared to the conventional Forward-Backward Re-estimation algorithm.

  13. Implementing a Hidden Markov Model Speech Recognition System in Programmable Logic

    OpenAIRE

    Melnikoff, Stephen Jonathan; Quigley, Steven Francis; Russell, Martin

    2001-01-01

    Performing Viterbi decoding for continuous real-time speech recognition is a highly computationally-demanding task, but is one which can take good advantage of a parallel processing architecture. To this end, we describe a system which uses an FPGA for the decoding and a PC for pre- and post-processing, taking advantage of the properties of this kind of programmable logic device, specifically its ability to perform in parallel the large number of additions and comparisons required. We compare...

  14. Time-Varying Noise Estimation for Speech Enhancement and Recognition Using Sequential Monte Carlo Method

    Directory of Open Access Journals (Sweden)

    Kaisheng Yao

    2004-11-01

    Full Text Available We present a method for sequentially estimating time-varying noise parameters. Noise parameters are sequences of time-varying mean vectors representing the noise power in the log-spectral domain. The proposed sequential Monte Carlo method generates a set of particles in compliance with the prior distribution given by clean speech models. The noise parameters in this model evolve according to random walk functions and the model uses extended Kalman filters to update the weight of each particle as a function of observed noisy speech signals, speech model parameters, and the evolved noise parameters in each particle. Finally, the updated noise parameter is obtained by means of minimum mean square error (MMSE estimation on these particles. For efficient computations, the residual resampling and Metropolis-Hastings smoothing are used. The proposed sequential estimation method is applied to noisy speech recognition and speech enhancement under strongly time-varying noise conditions. In both scenarios, this method outperforms some alternative methods.

  15. Image analysis in automatic system of pollen recognition

    OpenAIRE

    Piotr Rapiejko; Zbigniew M. Wawrzyniak; Ryszard S. Jachowicz; Dariusz Jurkiewicz

    2012-01-01

    In allergology practice and research, it would be convenient to receive pollen identification and monitoring results in much shorter time than it comes from human identification. Image based analysis is one of the approaches to an automated identification scheme for pollen grain and pattern recognition on such images is widely used as a powerful tool. The goal of such attempt is to provide accurate, fast recognition and classification and counting of pollen grains by computer system for monit...

  16. Automatic model-based face reconstruction and recognition

    OpenAIRE

    Breuer, Pia

    2011-01-01

    Three-dimensional Morphable Models (3DMM) are known to be valuable tools for both face reconstruction and face recognition. These models are particularly relevant in safety applications or Computer Graphics. In this thesis, contributions are made to address the major difficulties preceding and during the fitting process of the Morphable Model in the framework of a fully automated system.It is shown to which extent the reconstruction and recognition results depend on the initialization and wha...

  17. Automatic Facial Expression Recognition Using Features of Salient Facial Patches

    OpenAIRE

    Happy, S L; Routray, Aurobinda

    2015-01-01

    Extraction of discriminative features from salient facial patches plays a vital role in effective facial expression recognition. The accurate detection of facial landmarks improves the localization of the salient patches on face images. This paper proposes a novel framework for expression recognition by using appearance features of selected facial patches. A few prominent facial patches, depending on the position of facial landmarks, are extracted which are active during emotion elicitation. ...

  18. Superior Speech Acquisition and Robust Automatic Speech Recognition for Integrated Spacesuit Audio Systems Project

    Data.gov (United States)

    National Aeronautics and Space Administration — Astronauts suffer from poor dexterity of their hands due to the clumsy spacesuit gloves during Extravehicular Activity (EVA) operations and NASA has had a widely...

  19. Sparse representation in speech signal processing

    Science.gov (United States)

    Lee, Te-Won; Jang, Gil-Jin; Kwon, Oh-Wook

    2003-11-01

    We review the sparse representation principle for processing speech signals. A transformation for encoding the speech signals is learned such that the resulting coefficients are as independent as possible. We use independent component analysis with an exponential prior to learn a statistical representation for speech signals. This representation leads to extremely sparse priors that can be used for encoding speech signals for a variety of purposes. We review applications of this method for speech feature extraction, automatic speech recognition and speaker identification. Furthermore, this method is also suited for tackling the difficult problem of separating two sounds given only a single microphone.

  20. Suprasegmental lexical stress cues in visual speech can guide spoken-word recognition.

    Science.gov (United States)

    Jesse, Alexandra; McQueen, James M

    2014-01-01

    Visual cues to the individual segments of speech and to sentence prosody guide speech recognition. The present study tested whether visual suprasegmental cues to the stress patterns of words can also constrain recognition. Dutch listeners use acoustic suprasegmental cues to lexical stress (changes in duration, amplitude, and pitch) in spoken-word recognition. We asked here whether they can also use visual suprasegmental cues. In two categorization experiments, Dutch participants saw a speaker say fragments of word pairs that were segmentally identical but differed in their stress realization (e.g., 'ca-vi from cavia "guinea pig" vs. 'ka-vi from kaviaar "caviar"). Participants were able to distinguish between these pairs from seeing a speaker alone. Only the presence of primary stress in the fragment, not its absence, was informative. Participants were able to distinguish visually primary from secondary stress on first syllables, but only when the fragment-bearing target word carried phrase-level emphasis. Furthermore, participants distinguished fragments with primary stress on their second syllable from those with secondary stress on their first syllable (e.g., pro-'jec from projector "projector" vs. 'pro-jec from projectiel "projectile"), independently of phrase-level emphasis. Seeing a speaker thus contributes to spoken-word recognition by providing suprasegmental information about the presence of primary lexical stress. PMID:24134065

  1. How does language model size effects speech recognition accuracy for the Turkish language?

    Directory of Open Access Journals (Sweden)

    Behnam ASEFİSARAY

    2016-05-01

    Full Text Available In this paper we aimed at investigating the effect of Language Model (LM size on Speech Recognition (SR accuracy. We also provided details of our approach for obtaining the LM for Turkish. Since LM is obtained by statistical processing of raw text, we expect that by increasing the size of available data for training the LM, SR accuracy will improve. Since this study is based on recognition of Turkish, which is a highly agglutinative language, it is important to find out the appropriate size for the training data. The minimum required data size is expected to be much higher than the data needed to train a language model for a language with low level of agglutination such as English. In the experiments we also tried to adjust the Language Model Weight (LMW and Active Token Count (ATC parameters of LM as these are expected to be different for a highly agglutinative language. We showed that by increasing the training data size to an appropriate level, the recognition accuracy improved on the other hand changes on LMW and ATC did not have a positive effect on Turkish speech recognition accuracy.

  2. Using Mutual Information Criterion to Design an Efficient Phoneme Set for Chinese Speech Recognition

    Science.gov (United States)

    Zhang, Jin-Song; Hu, Xin-Hui; Nakamura, Satoshi

    Chinese is a representative tonal language, and it has been an attractive topic of how to process tone information in the state-of-the-art large vocabulary speech recognition system. This paper presents a novel way to derive an efficient phoneme set of tone-dependent units to build a recognition system, by iteratively merging a pair of tone-dependent units according to the principle of minimal loss of the Mutual Information (MI). The mutual information is measured between the word tokens and their phoneme transcriptions in a training text corpus, based on the system lexical and language model. The approach has a capability to keep discriminative tonal (and phoneme) contrasts that are most helpful for disambiguating homophone words due to lack of tones, and merge those tonal (and phoneme) contrasts that are not important for word disambiguation for the recognition task. This enables a flexible selection of phoneme set according to a balance between the MI information amount and the number of phonemes. We applied the method to traditional phoneme set of Initial/Finals, and derived several phoneme sets with different number of units. Speech recognition experiments using the derived sets showed its effectiveness.

  3. Automatic recognition of lactating sow behaviors through depth image processing

    Science.gov (United States)

    Manual observation and classification of animal behaviors is laborious, time-consuming, and of limited ability to process large amount of data. A computer vision-based system was developed that automatically recognizes sow behaviors (lying, sitting, standing, kneeling, feeding, drinking, and shiftin...

  4. Automatic SIMD parallelization of embedded applications based on pattern recognition

    NARCIS (Netherlands)

    Manniesing, R.; Karkowski, I.P.; Corporaal, H.

    2000-01-01

    This paper investigates the potential for automatic mapping of typical embedded applications to architectures with multimedia instruction set extensions. For this purpose a (pattern matching based) code transformation engine is used, which involves a three-step process of matching, condition checkin

  5. Development of an automated speech recognition interface for personal emergency response systems

    Directory of Open Access Journals (Sweden)

    Mihailidis Alex

    2009-07-01

    Full Text Available Abstract Background Demands on long-term-care facilities are predicted to increase at an unprecedented rate as the baby boomer generation reaches retirement age. Aging-in-place (i.e. aging at home is the desire of most seniors and is also a good option to reduce the burden on an over-stretched long-term-care system. Personal Emergency Response Systems (PERSs help enable older adults to age-in-place by providing them with immediate access to emergency assistance. Traditionally they operate with push-button activators that connect the occupant via speaker-phone to a live emergency call-centre operator. If occupants do not wear the push button or cannot access the button, then the system is useless in the event of a fall or emergency. Additionally, a false alarm or failure to check-in at a regular interval will trigger a connection to a live operator, which can be unwanted and intrusive to the occupant. This paper describes the development and testing of an automated, hands-free, dialogue-based PERS prototype. Methods The prototype system was built using a ceiling mounted microphone array, an open-source automatic speech recognition engine, and a 'yes' and 'no' response dialog modelled after an existing call-centre protocol. Testing compared a single microphone versus a microphone array with nine adults in both noisy and quiet conditions. Dialogue testing was completed with four adults. Results and discussion The microphone array demonstrated improvement over the single microphone. In all cases, dialog testing resulted in the system reaching the correct decision about the kind of assistance the user was requesting. Further testing is required with elderly voices and under different noise conditions to ensure the appropriateness of the technology. Future developments include integration of the system with an emergency detection method as well as communication enhancement using features such as barge-in capability. Conclusion The use of an automated

  6. Automatic Eye Detection Error as a Predictor of Face Recognition Performance

    OpenAIRE

    Dutta, Abhishek; Veldhuis, Raymond; Spreeuwers, Luuk

    2014-01-01

    Various facial image quality parameters like pose, illumination, noise, resolution, etc are known to be a predictor of face recognition performance. However, there still remain many other properties of facial images that are not captured by the existing quality parameters. In this paper, we propose a novel image quality parameter called the Automatic Eye Detection Error (AEDE) which measures the difference between manually located and automatically detected eye coordinates. Our experiment res...

  7. Automatic Signature Verification: Bridging the Gap between Existing Pattern Recognition Methods and Forensic Science

    OpenAIRE

    Malik, Muhammad Imran

    2015-01-01

    The main goal of this thesis is twofold. First, the thesis aims at bridging the gap between existing Pattern Recognition (PR) methods of automatic signature verification and the requirements for their application in forensic science. This gap, attributed by various factors ranging from system definition to evaluation, prevents automatic methods from being used by Forensic Handwriting Examiners (FHEs). Second, the thesis presents novel signature verification methods developed particularly cons...

  8. Purging Musical Instrument Sample Databases Using Automatic Musical Instrument Recognition Methods

    OpenAIRE

    Livshin, Arie; Rodet, Xavier

    2009-01-01

    cote interne IRCAM: Livshin09a None / None National audience Compilation of musical instrument sample databases requires careful elimination of badly recorded samples and validation of sample classification into correct categories. This paper introduces algorithms for automatic removal of bad instrument samples using Automatic Musical Instrument Recognition and Outlier Detection techniques. Best evaluation results on a methodically contaminated sound database are achieved using the i...

  9. Recognition of Speech of Normal-hearing Individuals with Tinnitus and Hyperacusis

    Directory of Open Access Journals (Sweden)

    Hennig, Tais Regina

    2011-01-01

    Full Text Available Introduction: Tinnitus and hyperacusis are increasingly frequent audiological symptoms that may occur in the absence of the hearing involvement, but it does not offer a lower impact or bothering to the affected individuals. The Medial Olivocochlear System helps in the speech recognition in noise and may be connected to the presence of tinnitus and hyperacusis. Objective: To evaluate the speech recognition of normal-hearing individual with and without complaints of tinnitus and hyperacusis, and to compare their results. Method: Descriptive, prospective and cross-study in which 19 normal-hearing individuals were evaluated with complaint of tinnitus and hyperacusis of the Study Group (SG, and 23 normal-hearing individuals without audiological complaints of the Control Group (CG. The individuals of both groups were submitted to the test List of Sentences in Portuguese, prepared by Costa (1998 to determine the Sentences Recognition Threshold in Silence (LRSS and the signal to noise ratio (S/N. The SG also answered the Tinnitus Handicap Inventory for tinnitus analysis, and to characterize hyperacusis the discomfort thresholds were set. Results: The CG and SG presented with average LRSS and S/N ratio of 7.34 dB NA and -6.77 dB, and of 7.20 dB NA and -4.89 dB, respectively. Conclusion: The normal-hearing individuals with or without audiological complaints of tinnitus and hyperacusis had a similar performance in the speech recognition in silence, which was not the case when evaluated in the presence of competitive noise, since the SG had a lower performance in this communication scenario, with a statistically significant difference.

  10. Automatic Instrument Recognition in Polyphonic Music Using Convolutional Neural Networks

    OpenAIRE

    Li, Peter; Qian, Jiyuan; Wang, Tian

    2015-01-01

    Traditional methods to tackle many music information retrieval tasks typically follow a two-step architecture: feature engineering followed by a simple learning algorithm. In these "shallow" architectures, feature engineering and learning are typically disjoint and unrelated. Additionally, feature engineering is difficult, and typically depends on extensive domain expertise. In this paper, we present an application of convolutional neural networks for the task of automatic musical instrument ...

  11. Image analysis in automatic system of pollen recognition

    Directory of Open Access Journals (Sweden)

    Piotr Rapiejko

    2012-12-01

    Full Text Available In allergology practice and research, it would be convenient to receive pollen identification and monitoring results in much shorter time than it comes from human identification. Image based analysis is one of the approaches to an automated identification scheme for pollen grain and pattern recognition on such images is widely used as a powerful tool. The goal of such attempt is to provide accurate, fast recognition and classification and counting of pollen grains by computer system for monitoring. The isolated pollen grain are objects extracted from microscopic image by CCD camera and PC computer under proper conditions for further analysis. The algorithms are based on the knowledge from feature vector analysis of estimated parameters calculated from grain characteristics, including morphological features, surface features and other applicable estimated characteristics. Segmentation algorithms specially tailored to pollen object characteristics provide exact descriptions of pollen characteristics (border and internal features already used by human expert. The specific characteristics and its measures are statistically estimated for each object. Some low level statistics for estimated local and global measures of the features establish the feature space. Some special care should be paid on choosing these feature and on constructing the feature space to optimize the number of subspaces for higher recognition rates in low-level classification for type differentiation of pollen grains.The results of estimated parameters of feature vector in low dimension space for some typical pollen types are presented, as well as some effective and fast recognition results of performed experiments for different pollens. The findings show the ewidence of using proper chosen estimators of central and invariant moments (M21, NM2, NM3, NM8 NM9, of tailored characteristics for good enough classification measures (efficiency > 95%, even for low dimensional classifiers

  12. Automatic target recognition in SAR images using multilinear analysis

    OpenAIRE

    Porgès, Tristan; Favier, Gérard

    2011-01-01

    International audience Multilinear analysis provides a powerful mathematical framework for analyzing synthetic aperture radar (SAR) images resulting from the interaction of multiple factors like sky luminosity and viewing angles, while preserving their original shape. In this paper, we propose a multilinear principal component analysis (MPCA) algorithm for target recognition in SAR images. First, we form a high order tensor with the training image set and we apply the higher-order singular...

  13. Dead regions in the cochlea: Implications for speech recognition and applicability of articulation index theory

    DEFF Research Database (Denmark)

    Vestergaard, Martin David

    2003-01-01

    -pass-filtered speech items. Data were collected from 22 hearing-impaired subjects with moderate-to-profound sensorineural hearing losses. The results showed that 11 subjects exhibited abnormal psychoacoustic behaviour in the TEN test, indicative of a possible dead region. Estimates of audibility were used to assess......Dead regions in the cochlea have been suggested to be responsible for failure by hearing aid users to benefit front apparently increased audibility in terms of speech intelligibility. As an alternative to the more cumbersome psychoacoustic tuning curve measurement, threshold-equalizing noise (TEN......) has been reported to enable diagnosis of dead regions. The purpose of the present study was first to assess the feasibility of the TEN test protocol, and second, to assess the ability of the procedure to reveal related functional impairment. The latter was done by a test for the recognition of low...

  14. Modeling speech imitation and ecological learning of auditory-motor maps

    OpenAIRE

    Claudia eCanevari; Leonardo eBadino; Alessandro eD'Ausilio; Luciano eFadiga; Giorgio eMetta

    2013-01-01

    Classical models of speech consider an antero-posterior distinction between perceptive and productive functions. However, the selective alteration of neural activity in speech motor centers, via transcranial magnetic stimulation, was shown to affect speech discrimination. On the automatic speech recognition (ASR) side, the recognition systems have classically relied solely on acoustic data, achieving rather good performance in optimal listening conditions. The main limitations of current ASR ...

  15. Commercial applications of speech interface technology: an industry at the threshold.

    OpenAIRE

    Oberteuffer, J A

    1995-01-01

    Speech interface technology, which includes automatic speech recognition, synthetic speech, and natural language processing, is beginning to have a significant impact on business and personal computer use. Today, powerful and inexpensive microprocessors and improved algorithms are driving commercial applications in computer command, consumer, data entry, speech-to-text, telephone, and voice verification. Robust speaker-independent recognition systems for command and navigation in personal com...

  16. Performance of Czech Speech Recognition with Language Models Created from Public Resources

    Directory of Open Access Journals (Sweden)

    V. Prochazka

    2011-12-01

    Full Text Available In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.

  17. A Digital Liquid State Machine With Biologically Inspired Learning and Its Application to Speech Recognition.

    Science.gov (United States)

    Zhang, Yong; Li, Peng; Jin, Yingyezhe; Choe, Yoonsuck

    2015-11-01

    This paper presents a bioinspired digital liquid-state machine (LSM) for low-power very-large-scale-integration (VLSI)-based machine learning applications. To the best of the authors' knowledge, this is the first work that employs a bioinspired spike-based learning algorithm for the LSM. With the proposed online learning, the LSM extracts information from input patterns on the fly without needing intermediate data storage as required in offline learning methods such as ridge regression. The proposed learning rule is local such that each synaptic weight update is based only upon the firing activities of the corresponding presynaptic and postsynaptic neurons without incurring global communications across the neural network. Compared with the backpropagation-based learning, the locality of computation in the proposed approach lends itself to efficient parallel VLSI implementation. We use subsets of the TI46 speech corpus to benchmark the bioinspired digital LSM. To reduce the complexity of the spiking neural network model without performance degradation for speech recognition, we study the impacts of synaptic models on the fading memory of the reservoir and hence the network performance. Moreover, we examine the tradeoffs between synaptic weight resolution, reservoir size, and recognition performance and present techniques to further reduce the overhead of hardware implementation. Our simulation results show that in terms of isolated word recognition evaluated using the TI46 speech corpus, the proposed digital LSM rivals the state-of-the-art hidden Markov-model-based recognizer Sphinx-4 and outperforms all other reported recognizers including the ones that are based upon the LSM or neural networks. PMID:25643415

  18. Automatic target recognition in synthetic aperture sonar images for autonomous mine hunting

    NARCIS (Netherlands)

    Quesson, B.A.J.; Sabel, J.C.; Bouma, H.; Dekker, R.J.; Lengrand-Lambert, J.

    2010-01-01

    The future of Mine Countermeasures (MCM) operations lies with unmanned platforms where Automatic Target Recognition (ATR) is an essential step in making the mine hunting process autonomous. At TNO, a new ATR method is currently being developed for use on an Autonomous Underwater Vehicle (AUV), using

  19. INTEGRATING MACHINE TRANSLATION AND SPEECH SYNTHESIS COMPONENT FOR ENGLISH TO DRAVIDIAN LANGUAGE SPEECH TO SPEECH TRANSLATION SYSTEM

    Directory of Open Access Journals (Sweden)

    J. SANGEETHA

    2015-02-01

    Full Text Available This paper provides an interface between the machine translation and speech synthesis system for converting English speech to Tamil text in English to Tamil speech to speech translation system. The speech translation system consists of three modules: automatic speech recognition, machine translation and text to speech synthesis. Many procedures for incorporation of speech recognition and machine translation have been projected. Still speech synthesis system has not yet been measured. In this paper, we focus on integration of machine translation and speech synthesis, and report a subjective evaluation to investigate the impact of speech synthesis, machine translation and the integration of machine translation and speech synthesis components. Here we implement a hybrid machine translation (combination of rule based and statistical machine translation and concatenative syllable based speech synthesis technique. In order to retain the naturalness and intelligibility of synthesized speech Auto Associative Neural Network (AANN prosody prediction is used in this work. The results of this system investigation demonstrate that the naturalness and intelligibility of the synthesized speech are strongly influenced by the fluency and correctness of the translated text.

  20. Current trends in multilingual speech processing

    Indian Academy of Sciences (India)

    Hervé Bourlard; John Dines; Mathew Magimai-Doss; Philip N Garner; David Imseng; Petr Motlicek; Hui Liang; Lakshmi Saheer; Fabio Valente

    2011-10-01

    In this paper, we describe recent work at Idiap Research Institute in the domain of multilingual speech processing and provide some insights into emerging challenges for the research community. Multilingual speech processing has been a topic of ongoing interest to the research community for many years and the field is now receiving renewed interest owing to two strong driving forces. Firstly, technical advances in speech recognition and synthesis are posing new challenges and opportunities to researchers. For example, discriminative features are seeing wide application by the speech recognition community, but additional issues arise when using such features in a multilingual setting. Another example is the apparent convergence of speech recognition and speech synthesis technologies in the form of statistical parametric methodologies. This convergence enables the investigation of new approaches to unified modelling for automatic speech recognition and text-to-speech synthesis (TTS) as well as cross-lingual speaker adaptation for TTS. The second driving force is the impetus being provided by both government and industry for technologies to help break down domestic and international language barriers, these also being barriers to the expansion of policy and commerce. Speech-to-speech and speech-to-text translation are thus emerging as key technologies at the heart of which lies multilingual speech processing.

  1. Non-linear Spectral Contrast Stretching for In-car Speech Recognition

    OpenAIRE

    Li, Weifeng; Bourlard, Hervé

    2007-01-01

    In this paper, we present a novel feature normalization method in the log-scaled spectral domain for improving the noise robustness of speech recognition front-ends. In the proposed scheme, a non-linear contrast stretching is added to the outputs of log mel-filterbanks (MFB) to imitate the adaptation of the auditory system under adverse conditions. This is followed by a two-dimensional filter to smooth out the processing artifacts. The proposed MFCC front-ends perform remarkably well on CENSR...

  2. Cued Speech: A visual communication mode for the Deaf society

    OpenAIRE

    Heracleous, Panikos; Beautemps, Denis

    2010-01-01

    Cued Speech is a visual mode of communication that uses handshapes and placements in combination with the mouth movements of speech to make the phonemes of a spoken language look different from each other and clearly understandable to deaf individuals. The aim of Cued Speech is to overcome the problems of lip reading and thus enable deaf persons to wholly understand spoken language. In this study, automatic phoneme recognition in Cued Speech for French based on hidden Markov model (HMMs) is i...

  3. High-order hidden Markov model for piecewise linear processes and applications to speech recognition.

    Science.gov (United States)

    Lee, Lee-Min; Jean, Fu-Rong

    2016-08-01

    The hidden Markov models have been widely applied to systems with sequential data. However, the conditional independence of the state outputs will limit the output of a hidden Markov model to be a piecewise constant random sequence, which is not a good approximation for many real processes. In this paper, a high-order hidden Markov model for piecewise linear processes is proposed to better approximate the behavior of a real process. A parameter estimation method based on the expectation-maximization algorithm was derived for the proposed model. Experiments on speech recognition of noisy Mandarin digits were conducted to examine the effectiveness of the proposed method. Experimental results show that the proposed method can reduce the recognition error rate compared to a baseline hidden Markov model. PMID:27586781

  4. Text Detection and Recognition with Speech Output for Visually Challenged Person: A Review

    Directory of Open Access Journals (Sweden)

    Ms.Rupali D. Dharmale

    2015-03-01

    Full Text Available Reading text from scene, images and text boards is an exigent task for visually challenged persons. This task has been proposed to be carried out with the help of image processing. Since a long period of time, image processing has helped a lot in the field of object recognition and still an emerging area of research. The proposed system reads the text encountered in images and text boards with the aim to provide support to the visually challenged persons. Text detection and recognition in natural scene can give valuable information for many applications. In this work, an approach has been attempted to extract and recognize text from scene images and convert that recognized text into speech. This task can definitely be an empowering force in a visually challenged person's life and can be supportive in relieving them of their frustration of not being able to read whatever they want, thus enhancing the quality of their lives.

  5. Automatic recognition of bone for x-ray bone densitometry

    Science.gov (United States)

    Shepp, Larry A.; Vardi, Y.; Lazewatsky, J.; Libeau, James; Stein, Jay A.

    1991-06-01

    We described a method for automatically identifying and separating pixels representing bone from those representing soft tissue in a dual- energy point-scanned projection radiograph of the abdomen. In order to achieve stable quantitative measurement of projected bone mineral density, a calibration using sample bone in regions containing only soft tissue must be performed. In addition, the projected area of bone must be measured. We show that, using an image with a realistically low noise, the histogram of pixel values exhibits a well-defined peak corresponding to the soft tissue region. A threshold at a fixed multiple of the calibration segment value readily separates bone from soft tissue in a wide variety of patient studies. Our technique, which is employed in the Hologic QDR-1000 Bone Densitometer, is rapid, robust, and significantly simpler than a conventional artificial intelligence approach using edge-detection to define objects and expert systems to recognize them.

  6. Automatic music genres classification as a pattern recognition problem

    Science.gov (United States)

    Ul Haq, Ihtisham; Khan, Fauzia; Sharif, Sana; Shaukat, Arsalan

    2013-12-01

    Music genres are the simplest and effect descriptors for searching music libraries stores or catalogues. The paper compares the results of two automatic music genres classification systems implemented by using two different yet simple classifiers (K-Nearest Neighbor and Naïve Bayes). First a 10-12 second sample is selected and features are extracted from it, and then based on those features results of both classifiers are represented in the form of accuracy table and confusion matrix. An experiment carried out on test 60 taken from middle of a song represents the true essence of its genre as compared to the samples taken from beginning and ending of a song. The novel techniques have achieved an accuracy of 91% and 78% by using Naïve Bayes and KNN classifiers respectively.

  7. Analysis of speech under stress using Linear techniques and Non-Linear techniques for emotion recognition system

    OpenAIRE

    A. A. Khulageand; B. V. Pathak

    2012-01-01

    Analysis of speech for recognition of stress is important for identification of emotional state of person. This can be done using ‘Linear Techniques’, which has different parameters like pitch, vocal tract spectrum, formant frequencies, Duration, MFCC etc. which are used for extraction of features from speech. TEO-CB-Auto-Env is the method which is non-linear method of features extraction. Analysis is done using TU-Berlin (Technical University of Berlin) German database. Here e...

  8. Unsupervised Topic Adaptation for Lecture Speech Retrieval

    OpenAIRE

    Fujii, Atsushi; Itou, Katunobu; Akiba, Tomoyosi; Ishikawa, Tetsuya

    2004-01-01

    We are developing a cross-media information retrieval system, in which users can view specific segments of lecture videos by submitting text queries. To produce a text index, the audio track is extracted from a lecture video and a transcription is generated by automatic speech recognition. In this paper, to improve the quality of our retrieval system, we extensively investigate the effects of adapting acoustic and language models on speech recognition. We perform an MLLR-based method to adapt...

  9. Automatic target recognition performance losses in the presence of atmospheric and camera effects

    Science.gov (United States)

    Chen, Xiaohan; Schmid, Natalia A.

    2010-04-01

    The importance of networked automatic target recognition systems for surveillance applications is continuously increasing. Because of the requirement of a low cost and limited payload, these networks are traditionally equipped with lightweight, low-cost sensors such as electro-optical (EO) or infrared sensors. The quality of imagery acquired by these sensors critically depends on the environmental conditions, type and characteristics of sensors, and absence of occluding or concealing objects. In the past, a large number of efficient detection, tracking, and recognition algorithms have been designed to operate on imagery of good quality. However, detection and recognition limits under nonideal environmental and/or sensor-based distortions have not been carefully evaluated. We introduce a fully automatic target recognition system that involves a Haar-based detector to select potential regions of interest within images, performs adjustment of detected regions, segments potential targets using a region-based approach, identifies targets using Bessel K form-based encoding, and performs clutter rejection. We investigate the effects of environmental and camera conditions on target detection and recognition performance. Two databases are involved. One is a simulated database generated using a 3-D tool. The other database is formed by imaging 10 die-cast models of military vehicles from different elevation and orientation angles. The database contains imagery acquired both indoors and outdoors. The indoors data set is composed of clear and distorted images. The distortions include defocus blur, sided illumination, low contrast, shadows, and occlusions. All images in this database, however, have a uniform (blue) background. The indoors database is applied to evaluate the degradations of recognition performance due to camera and illumination effects. The database collected outdoors includes a real background and is much more complex to process. The numerical results

  10. Compensating Acoustic Mismatch Using Class-Based Histogram Equalization for Robust Speech Recognition

    Science.gov (United States)

    Suh, Youngjoo; Kim, Sungtak; Kim, Hoirin

    2007-12-01

    A new class-based histogram equalization method is proposed for robust speech recognition. The proposed method aims at not only compensating for an acoustic mismatch between training and test environments but also reducing the two fundamental limitations of the conventional histogram equalization method, the discrepancy between the phonetic distributions of training and test speech data, and the nonmonotonic transformation caused by the acoustic mismatch. The algorithm employs multiple class-specific reference and test cumulative distribution functions, classifies noisy test features into their corresponding classes, and equalizes the features by using their corresponding class reference and test distributions. The minimum mean-square error log-spectral amplitude (MMSE-LSA)-based speech enhancement is added just prior to the baseline feature extraction to reduce the corruption by additive noise. The experiments on the Aurora2 database proved the effectiveness of the proposed method by reducing relative errors by[InlineEquation not available: see fulltext.] over the mel-cepstral-based features and by[InlineEquation not available: see fulltext.] over the conventional histogram equalization method, respectively.

  11. A Rapid Model Adaptation Technique for Emotional Speech Recognition with Style Estimation Based on Multiple-Regression HMM

    Science.gov (United States)

    Ijima, Yusuke; Nose, Takashi; Tachibana, Makoto; Kobayashi, Takao

    In this paper, we propose a rapid model adaptation technique for emotional speech recognition which enables us to extract paralinguistic information as well as linguistic information contained in speech signals. This technique is based on style estimation and style adaptation using a multiple-regression HMM (MRHMM). In the MRHMM, the mean parameters of the output probability density function are controlled by a low-dimensional parameter vector, called a style vector, which corresponds to a set of the explanatory variables of the multiple regression. The recognition process consists of two stages. In the first stage, the style vector that represents the emotional expression category and the intensity of its expressiveness for the input speech is estimated on a sentence-by-sentence basis. Next, the acoustic models are adapted using the estimated style vector, and then standard HMM-based speech recognition is performed in the second stage. We assess the performance of the proposed technique in the recognition of simulated emotional speech uttered by both professional narrators and non-professional speakers.

  12. Automatic Recognition Method for Optical Measuring Instruments Based on Machine Vision

    Institute of Scientific and Technical Information of China (English)

    SONG Le; LIN Yuchi; HAO Liguo

    2008-01-01

    Based on a comprehensive study of various algorithms, the automatic recognition of traditional ocular optical measuring instruments is realized. Taking a universal tools microscope (UTM) lens view image as an example, a 2-layer automatic recognition model for data reading is established after adopting a series of pre-processing algorithms. This model is an optimal combination of the correlation-based template matching method and a concurrent back propagation (BP) neural network. Multiple complementary feature extraction is used in generating the eigenvectors of the concurrent network. In order to improve fault-tolerance capacity, rotation invariant features based on Zernike moments are extracted from digit characters and a 4-dimensional group of the outline features is also obtained. Moreover, the operating time and reading accuracy can be adjusted dynamically by setting the threshold value. The experimental result indicates that the newly developed algorithm has optimal recognition precision and working speed. The average reading ratio can achieve 97.23%. The recognition method can automatically obtain the results of optical measuring instruments rapidly and stably without modifying their original structure, which meets the application requirements.

  13. Contribution to automatic image recognition applied to robot technology

    International Nuclear Information System (INIS)

    This paper describes a method for the analysis and interpretation of the images of objects located in a plain scene which is the environment of a robot. The first part covers the recovery of the contour of objects present in the image, and discusses a novel contour-following technique based on the line arborescence concept in combination with a 'cost function' giving a quantitative assessment of contour quality. We present heuristics for moderate-cost, minimum-time arborescence coverage, which is equivalent to following probable contour lines in the image. A contour segmentation technique, invariant in the translational and rotational modes, is presented next. The second part describes a recognition method based on the above invariant encoding: the algorithm performs a preliminary screening based on coarse data derived from segmentation, followed by a comparison of forms with probable identity through application of a distance specified in terms of the invariant encoding. The last part covers the outcome of the above investigations, which have found an industrial application in the vision system of a range of robots. The system is set up in a 16-bit microprocessor and operates in real time. (author)

  14. Automatic target recognition on land using three dimensional (3D laser radar and artificial neural networks

    Directory of Open Access Journals (Sweden)

    Göztepe, K.

    2013-05-01

    Full Text Available During combat, measuring the dimensions of targets is extremely important for knowing when to fire on the enemy. The importance of identifying a known target on land emphasizes the importance of techniques devoted to automatic target recognition. Although a number of object-recognition techniques have been developed in the past, none of them have provided the desired specifics for unidentified target recognition. Studies on target recognition are largely based on images that assume that images of a known target can be readily viewed under any circumstance. But this is not true for military operations conducted on various terrains under specific circumstances. Usually it is not possible to capture images of unidentified objects because of weather, inadequate equipment, or concealment. In this study, a new approach that integrates neural networks and laser radar has been developed for automatic target recognition in order to reduce the above-mentioned problems. Unlike current studies, the proposed model uses the geometric dimensions of unidentified targets in order to detect and recognise them under severe weather conditions.

  15. ANALYSIS OF SPEECH UNDER STRESS USING LINEAR TECHNIQUES AND NON-LINEAR TECHNIQUES FOR EMOTION RECOGNITION SYSTEM

    Directory of Open Access Journals (Sweden)

    A. A. Khulageand

    2012-07-01

    Full Text Available Analysis of speech for recognition of stress is important for identification of emotional state of person. This can be done using ‘Linear Techniques’, which has different parameters like pitch, vocal tract spectrum, formant frequencies, Duration, MFCC etc. which are used for extraction of features from speech. TEO-CB-Auto-Env is the method which is non-linear method of features extraction. Analysis is done using TU-Berlin (Technical University of Berlin German database. Here emotion recognition is done for different emotions like neutral, happy, disgust, sad, boredom and anger. Emotion recognition is used in lie detector, database access systems, and in military for recognition of soldiers’ emotion identification during the war.

  16. Unobtrusive multimodal emotion detection in adaptive interfaces: speech and facial expressions

    NARCIS (Netherlands)

    Truong, K.P.; Leeuwen, D.A. van; Neerincx, M.A.

    2007-01-01

    Two unobtrusive modalities for automatic emotion recognition are discussed: speech and facial expressions. First, an overview is given of emotion recognition studies based on a combination of speech and facial expressions. We will identify difficulties concerning data collection, data fusion, system

  17. Automatic 3D object recognition and reconstruction based on neuro-fuzzy modelling

    Science.gov (United States)

    Samadzadegan, Farhad; Azizi, Ali; Hahn, Michael; Lucas, Curo

    Three-dimensional object recognition and reconstruction (ORR) is a research area of major interest in computer vision and photogrammetry. Virtual cities, for example, is one of the exciting application fields of ORR which became very popular during the last decade. Natural and man-made objects of cities such as trees and buildings are complex structures and automatic recognition and reconstruction of these objects from digital aerial images but also other data sources is a big challenge. In this paper a novel approach for object recognition is presented based on neuro-fuzzy modelling. Structural, textural and spectral information is extracted and integrated in a fuzzy reasoning process. The learning capability of neural networks is introduced to the fuzzy recognition process by taking adaptable parameter sets into account which leads to the neuro-fuzzy approach. Object reconstruction follows recognition seamlessly by using the recognition output and the descriptors which have been extracted for recognition. A first successful application of this new ORR approach is demonstrated for the three object classes 'buildings', 'cars' and 'trees' by using aerial colour images of an urban area of the town of Engen in Germany.

  18. Morphological self-organizing feature map neural network with applications to automatic target recognition

    Institute of Scientific and Technical Information of China (English)

    Shijun Zhang; Zhongliang Jing; Jianxun Li

    2005-01-01

    @@ The rotation invariant feature of the target is obtained using the multi-direction feature extraction property of the steerable filter. Combining the morphological operation top-hat transform with the self-organizing feature map neural network, the adaptive topological region is selected. Using the erosion operation, the topological region shrinkage is achieved. The steerable filter based morphological self-organizing feature map neural network is applied to automatic target recognition of binary standard patterns and realworld infrared sequence images. Compared with Hamming network and morphological shared-weight networks respectively, the higher recognition correct rate, robust adaptability, quick training, and better generalization of the proposed method are achieved.

  19. Psychometric Functions for Shortened Administrations of a Speech Recognition Approach Using Tri-Word Presentations and Phonemic Scoring

    Science.gov (United States)

    Gelfand, Stanley A.; Gelfand, Jessica T.

    2012-01-01

    Method: Complete psychometric functions for phoneme and word recognition scores at 8 signal-to-noise ratios from -15 dB to 20 dB were generated for the first 10, 20, and 25, as well as all 50, three-word presentations of the Tri-Word or Computer Assisted Speech Recognition Assessment (CASRA) Test (Gelfand, 1998) based on the results of 12…

  20. Automatic Recognition of Fetal Facial Standard Plane in Ultrasound Image via Fisher Vector.

    Science.gov (United States)

    Lei, Baiying; Tan, Ee-Leng; Chen, Siping; Zhuo, Liu; Li, Shengli; Ni, Dong; Wang, Tianfu

    2015-01-01

    Acquisition of the standard plane is the prerequisite of biometric measurement and diagnosis during the ultrasound (US) examination. In this paper, a new algorithm is developed for the automatic recognition of the fetal facial standard planes (FFSPs) such as the axial, coronal, and sagittal planes. Specifically, densely sampled root scale invariant feature transform (RootSIFT) features are extracted and then encoded by Fisher vector (FV). The Fisher network with multi-layer design is also developed to extract spatial information to boost the classification performance. Finally, automatic recognition of the FFSPs is implemented by support vector machine (SVM) classifier based on the stochastic dual coordinate ascent (SDCA) algorithm. Experimental results using our dataset demonstrate that the proposed method achieves an accuracy of 93.27% and a mean average precision (mAP) of 99.19% in recognizing different FFSPs. Furthermore, the comparative analyses reveal the superiority of the proposed method based on FV over the traditional methods. PMID:25933215

  1. Face Prediction Model for an Automatic Age-invariant Face Recognition System

    OpenAIRE

    Yadav, Poonam

    2015-01-01

    Automated face recognition and identification softwares are becoming part of our daily life; it finds its abode not only with Facebook's auto photo tagging, Apple's iPhoto, Google's Picasa, Microsoft's Kinect, but also in Homeland Security Department's dedicated biometric face detection systems. Most of these automatic face identification systems fail where the effects of aging come into the picture. Little work exists in the literature on the subject of face prediction that accounts for agin...

  2. Pseudo-Zernike Based Multi-Pass Automatic Target Recognition From Multi-Channel SAR

    OpenAIRE

    Carmine CLEMENTE; Pallotta, Luca; Proudler, Ian; De Maio, Antonio; John J. Soraghan; Farina, Alfonso

    2014-01-01

    The capability to exploit multiple sources of information is of fundamental importance in a battlefield scenario. Information obtained from different sources, and separated in space and time, provide the opportunity to exploit diversities in order to mitigate uncertainty. For the specific challenge of Automatic Target Recognition (ATR) from radar platforms, both channel (e.g. polarization) and spatial diversity can provide useful information for such a specific and critical task. In this pape...

  3. Automatic Modulation Recognition Using Wavelet Transform and Neural Networks in Wireless Systems

    OpenAIRE

    Dayoub I; Hamouda W; Hassan K; Berbineau M

    2010-01-01

    Modulation type is one of the most important characteristics used in signal waveform identification. In this paper, an algorithm for automatic digital modulation recognition is proposed. The proposed algorithm is verified using higher-order statistical moments (HOM) of continuous wavelet transform (CWT) as a features set. A multilayer feed-forward neural network trained with resilient backpropagation learning algorithm is proposed as a classifier. The purpose is to discriminate among differe...

  4. Automatic facial feature extraction and expression recognition based on neural network

    OpenAIRE

    Khandait, S. P.; Dr. R.C.Thool; P.D.Khandait

    2012-01-01

    In this paper, an approach to the problem of automatic facial feature extraction from a still frontal posed image and classification and recognition of facial expression and hence emotion and mood of a person is presented. Feed forward back propagation neural network is used as a classifier for classifying the expressions of supplied face into seven basic categories like surprise, neutral, sad, disgust, fear, happy and angry. For face portion segmentation and localization, morphological image...

  5. Automatic recognition of circuit patterns on semiconductor wafers from multiple scanning electron microscope images

    International Nuclear Information System (INIS)

    A technique is proposed for high-precision, automatic recognition of circuit patterns on a semiconductor wafer from multiple scanning electron microscope (SEM) images. This technique uses multiple SEM images obtained by selective detection of secondary and backscattered electrons emitted from a wafer surface irradiated with primary electrons. It automatically detects circuit patterns in these images. The appearances of circuit patterns in SEM images vary widely depending on the structure, the material and the pattern layout. The proposed technique can cope with such a large variation in pattern appearance by adaptively selecting two recognition methods based on pattern structure and pattern density. Other information, such as the images to be processed and the contrast between pattern and non-pattern regions, is also utilized for recognition. The technique provides effective preprocessing for automating defect classification. It is expected to improve the efficacy of process monitoring and yield management in semiconductor device fabrication. Experimental results for five wafers (from which 421 circuit pattern images were obtained) demonstrate that the proposed technique can automatically recognize circuit patterns with an accuracy of 99.8%

  6. A Novel Algorithm for Acoustic and Visual Classifiers Decision Fusion in Audio-Visual Speech Recognition System

    Directory of Open Access Journals (Sweden)

    P.S. Sathidevi

    2010-03-01

    Full Text Available Audio-visual speech recognition (AVSR using acoustic and visual signals of speech have received attention recently because of its robustness in noisy environments. Perceptual studies also support this approach by emphasizing the importance of visual information for speech recognition in humans. An important issue in decision fusion based AVSR system is how to obtain the appropriate integration weight for the speech modalities to integrate and ensure the combined AVSR system’s performances better than that of the audio-only and visual-only systems under various noise conditions. To solve this issue, we present a genetic algorithm (GA based optimization scheme to obtain the appropriate integration weight from the relative reliability of each modality. The performance of the proposed GA optimized reliability-ratio based weight estimation scheme is demonstrated via single speaker, mobile functions isolated word recognition experiments. The results show that the proposed scheme improves robust recognition accuracy over the conventional unimodal systems and the baseline reliability ratio-based AVSR system under various signal to noise ratio conditions.

  7. Computer-Mediated Input, Output and Feedback in the Development of L2 Word Recognition from Speech

    Science.gov (United States)

    Matthews, Joshua; Cheng, Junyu; O'Toole, John Mitchell

    2015-01-01

    This paper reports on the impact of computer-mediated input, output and feedback on the development of second language (L2) word recognition from speech (WRS). A quasi-experimental pre-test/treatment/post-test research design was used involving three intact tertiary level English as a Second Language (ESL) classes. Classes were either assigned to…

  8. Investigating an Application of Speech-to-Text Recognition: A Study on Visual Attention and Learning Behaviour

    Science.gov (United States)

    Huang, Y-M.; Liu, C-J.; Shadiev, Rustam; Shen, M-H.; Hwang, W-Y.

    2015-01-01

    One major drawback of previous research on speech-to-text recognition (STR) is that most findings showing the effectiveness of STR for learning were based upon subjective evidence. Very few studies have used eye-tracking techniques to investigate visual attention of students on STR-generated text. Furthermore, not much attention was paid to…

  9. Automatic speech recognizer based on the Spanish spoken in Valdivia, Chile

    Science.gov (United States)

    Sanchez, Maria L.; Poblete, Victor H.; Sommerhoff, Jorge

    2001-05-01

    The performance of an automatic speech recognizer is affected by training process (dependent on or independent of the speaker) and the size of the vocabulary. The language used in this study was the Spanish spoken in the city of Valdivia, Chile. A representative sample of 14 students and six professionals all natives of Valdivia (ten women and ten men) were used to complete the study. The sample ranged in age between 20 and 30 years old. Two systems were programmed based on the classical principles: digitalizing, end point detection, linear prediction coding, cepstral coefficients, dynamic time warping, and a final decision stage with a previous step of training: (i) one dependent speaker (15 words: five colors and ten numbers), (ii) one independent speaker (30 words: ten verbs, ten nouns, and ten adjectives). A simple didactical application, with options to choose colors, numbers and drawings of the verbs, nouns and adjectives, was designed to be used with a personal computer. In both programs, the tests carried out showed a tendency towards errors in short words with monosyllables like ``flor,'' and ``sol.'' The best results were obtained in words with three syllables like ``disparar'' and ``mojado.'' [Work supported by Proyecto DID UACh N S-200278.

  10. Rapid and automatic speech-specific learning mechanism in human neocortex.

    Science.gov (United States)

    Kimppa, Lilli; Kujala, Teija; Leminen, Alina; Vainio, Martti; Shtyrov, Yury

    2015-09-01

    A unique feature of human communication system is our ability to rapidly acquire new words and build large vocabularies. However, its neurobiological foundations remain largely unknown. In an electrophysiological study optimally designed to probe this rapid formation of new word memory circuits, we employed acoustically controlled novel word-forms incorporating native and non-native speech sounds, while manipulating the subjects' attention on the input. We found a robust index of neurolexical memory-trace formation: a rapid enhancement of the brain's activation elicited by novel words during a short (~30min) perceptual exposure, underpinned by fronto-temporal cortical networks, and, importantly, correlated with behavioural learning outcomes. Crucially, this neural memory trace build-up took place regardless of focused attention on the input or any pre-existing or learnt semantics. Furthermore, it was found only for stimuli with native-language phonology, but not for acoustically closely matching non-native words. These findings demonstrate a specialised cortical mechanism for rapid, automatic and phonology-dependent formation of neural word memory circuits. PMID:26074199

  11. Contribution to automatic handwritten characters recognition. Application to optical moving characters recognition

    International Nuclear Information System (INIS)

    This paper describes a research work on computer aided vision relating to the design of a vision system which can recognize isolated handwritten characters written on a mobile support. We use a technique which consists in analyzing information contained in the contours of the polygon circumscribed to the character's shape. These contours are segmented and labelled to give a new set of features constituted by: - right and left 'profiles', - topological and algebraic unvarying properties. A new method of character's recognition induced from this representation based on a multilevel hierarchical technique is then described. In the primary level, we use a fuzzy classification with dynamic programming technique using 'profiles'. The other levels adjust the recognition by using topological and algebraic unvarying properties. Several results are presented and an accuracy of 99 pc was reached for handwritten numeral characters, thereby attesting the robustness of our algorithm. (author)

  12. Comparative Study on Feature Selection and Fusion Schemes for Emotion Recognition from Speech

    Directory of Open Access Journals (Sweden)

    Santiago Planet

    2012-09-01

    Full Text Available The automatic analysis of speech to detect affective states may improve the way users interact with electronic devices. However, the analysis only at the acoustic level could be not enough to determine the emotion of a user in a realistic scenario. In this paper we analyzed the spontaneous speech recordings of the FAU Aibo Corpus at the acoustic and linguistic levels to extract two sets of features. The acoustic set was reduced by a greedy procedure selecting the most relevant features to optimize the learning stage. We compared two versions of this greedy selection algorithm by performing the search of the relevant features forwards and backwards. We experimented with three classification approaches: Naïve-Bayes, a support vector machine and a logistic model tree, and two fusion schemes: decision-level fusion, merging the hard-decisions of the acoustic and linguistic classifiers by means of a decision tree; and feature-level fusion, concatenating both sets of features before the learning stage. Despite the low performance achieved by the linguistic data, a dramatic improvement was achieved after its combination with the acoustic information, improving the results achieved by this second modality on its own. The results achieved by the classifiers using the parameters merged at feature level outperformed the classification results of the decision-level fusion scheme, despite the simplicity of the scheme. Moreover, the extremely reduced set of acoustic features obtained by the greedy forward search selection algorithm improved the results provided by the full set.

  13. 声纹识别中合成语音的鲁棒性%Robust Speaker Recognition against Synthetic Speech

    Institute of Scientific and Technical Information of China (English)

    陈联武; 郭武; 戴礼荣

    2011-01-01

    随着以隐马尔科夫模型为基础的语音合成技术的发展,冒认者很容易利用该技术生成具有目标说话人特性的合成语音,这对现有的声纹识别系统构成巨大威胁.针对此问题,文中从统计学的角度分析自然语音与合成语音在实倒谱上的区别,并提出对合成语音具有鲁棒性的声纹识别系统.实验结果初步表明,相比于传统的声纹识别系统,在对自然语音的等错误率不变的情况下,该系统对合成语音的错误接受率由99.2%降为0.%With the development of the hidden markov model ( HMM) based speech synthesis technology, it is easy for impostors to produce synthetic speech with the specific speakers characteristics, which becomes an enormous threat to the existing speaker recognition system. In this paper, the difference between natural speech and synthetic speech is investigated on the real part of cepstrum. And a speaker recognition system is proposed which is robust against synthetic speech. Experimental results demonstrate that the false accept rate (FAR) for synthetic speech is zero in the proposed system, while that of the existing speaker recognition system is 99. 2% with the equal error rate (EER) for natural speech unchanged.

  14. Definition and automatic anatomy recognition of lymph node zones in the pelvis on CT images

    Science.gov (United States)

    Liu, Yu; Udupa, Jayaram K.; Odhner, Dewey; Tong, Yubing; Guo, Shuxu; Attor, Rosemary; Reinicke, Danica; Torigian, Drew A.

    2016-03-01

    Currently, unlike IALSC-defined thoracic lymph node zones, no explicitly provided definitions for lymph nodes in other body regions are available. Yet, definitions are critical for standardizing the recognition, delineation, quantification, and reporting of lymphadenopathy in other body regions. Continuing from our previous work in the thorax, this paper proposes a standardized definition of the grouping of pelvic lymph nodes into 10 zones. We subsequently employ our earlier Automatic Anatomy Recognition (AAR) framework designed for body-wide organ modeling, recognition, and delineation to actually implement these zonal definitions where the zones are treated as anatomic objects. First, all 10 zones and key anatomic organs used as anchors are manually delineated under expert supervision for constructing fuzzy anatomy models of the assembly of organs together with the zones. Then, optimal hierarchical arrangement of these objects is constructed for the purpose of achieving the best zonal recognition. For actual localization of the objects, two strategies are used -- optimal thresholded search for organs and one-shot method for the zones where the known relationship of the zones to key organs is exploited. Based on 50 computed tomography (CT) image data sets for the pelvic body region and an equal division into training and test subsets, automatic zonal localization within 1-3 voxels is achieved.

  15. Wifi-based Indoor Navigation with Mobile GIS and Speech Recognition

    Directory of Open Access Journals (Sweden)

    Jiangfan Feng

    2012-11-01

    Full Text Available As the development of mobile communications and wireless networking technology and people's increasing demand for wireless location services, positioning services are more and more important and become the focus of research. However, the most common technology which supports outdoor positioning services is global positioning system (GPS, but GPS has been subjected to some restrictions in the complex environment. WLAN (Wireless Local Area Networks has matured gradually in recent years, so positioning technique based on Wi-Fi can assist GPS. In this paper, we describe two positioning techniques based on WLAN: Triangulation and fingerprinting. Additionally, the method which associated with speech recognition and spatial data model is presented. Ultimately, the experimental result is encouraging and indicates that the proposed approach is effective.

  16. Improving the Syllable-Synchronous Network Search Algorithm for Word Decoding in Continuous Chinese Speech Recognition

    Institute of Scientific and Technical Information of China (English)

    郑方; 武健; 宋战江

    2000-01-01

    The previously proposed syllable-synchronous network search (SSNS) algorithm plays a very important role in the word decoding of the continuous Chinese speech recognition and achieves satisfying performance. Several related key factors that may affect the overall word decoding effect are carefully studied in this paper, including the perfecting of the vocabulary, the big-discount Turing re-estimating of the N-Gram probabilities, and the managing of the searching path buffers. Based on these discussions, corresponding approaches to improving the SSNS algorithm are proposed. Compared with the previous version of SSNS algorithm, the new version decreases the Chinese character error rate (CCER) in the word decoding by 42.1% across a database consisting of a large number of testing sentences (syllable strings).

  17. Data Collection in Zooarchaeology: Incorporating Touch-Screen, Speech-Recognition, Barcodes, and GIS

    Directory of Open Access Journals (Sweden)

    W. Flint Dibble

    2015-12-01

    Full Text Available When recording observations on specimens, zooarchaeologists typically use a pen and paper or a keyboard. However, the use of awkward terms and identification codes when recording thousands of specimens makes such data entry prone to human transcription errors. Improving the quantity and quality of the zooarchaeological data we collect can lead to more robust results and new research avenues. This paper presents design tools for building a customized zooarchaeological database that leverages accessible and affordable 21st century technologies. Scholars interested in investing time in designing a custom-database in common software (here, Microsoft Access can take advantage of the affordable touch-screen, speech-recognition, and geographic information system (GIS technologies described here. The efficiency that these approaches offer a research project far exceeds the time commitment a scholar must invest to deploy them.

  18. Hindi Digits Recognition System on Speech Data Collected in Different Natural Noise Environments

    Directory of Open Access Journals (Sweden)

    Babita Saxena

    2015-02-01

    Full Text Available This paper presents a baseline digits speech recogn izer for Hindi language. The recording environment is different for all speakers, since th e data is collected in their respective homes. The different environment refers to vehicle horn no ises in some road facing rooms, internal background noises in some rooms like opening doors, silence in some rooms etc. All these recordings are used for training acoustic model. Th e Acoustic Model is trained on 8 speakers’ audio data. The vocabulary size of the recognizer i s 10 words. HTK toolkit is used for building acoustic model and evaluating the recognition rate of the recognizer. The efficiency of the recognizer developed on recorded data, is shown at the end of the paper and possible directions for future research work are suggested.

  19. Classifier Subset Selection for the Stacked Generalization Method Applied to Emotion Recognition in Speech

    Directory of Open Access Journals (Sweden)

    Aitor Álvarez

    2015-12-01

    Full Text Available In this paper, a new supervised classification paradigm, called classifier subset selection for stacked generalization (CSS stacking, is presented to deal with speech emotion recognition. The new approach consists of an improvement of a bi-level multi-classifier system known as stacking generalization by means of an integration of an estimation of distribution algorithm (EDA in the first layer to select the optimal subset from the standard base classifiers. The good performance of the proposed new paradigm was demonstrated over different configurations and datasets. First, several CSS stacking classifiers were constructed on the RekEmozio dataset, using some specific standard base classifiers and a total of 123 spectral, quality and prosodic features computed using in-house feature extraction algorithms. These initial CSS stacking classifiers were compared to other multi-classifier systems and the employed standard classifiers built on the same set of speech features. Then, new CSS stacking classifiers were built on RekEmozio using a different set of both acoustic parameters (extended version of the Geneva Minimalistic Acoustic Parameter Set (eGeMAPS and standard classifiers and employing the best meta-classifier of the initial experiments. The performance of these two CSS stacking classifiers was evaluated and compared. Finally, the new paradigm was tested on the well-known Berlin Emotional Speech database. We compared the performance of single, standard stacking and CSS stacking systems using the same parametrization of the second phase. All of the classifications were performed at the categorical level, including the six primary emotions plus the neutral one.

  20. Speech Based Question Recognition of Interactive Ubiquitous Teaching Robot using Supervised Classifier

    Directory of Open Access Journals (Sweden)

    Umarani S D#

    2011-06-01

    Full Text Available This paper presents the research work that aims at designing a speech based interactive ubiquitous teaching robot for quality enhancement of Higher Education ensuring better performance by making possibility of interaction between the Robot Teacher and Student/ Learner. Students/ Learner can ask technical questions for which the robot replies. Here the reply by the Robot is limited by the answers available in the database. The proposed speech based interactive teaching system has two phases; they are training and testing. A feed forward neural network is used as a supervised classifier after it istrained by means of a back propagation (BP algorithm using the features of sound signal from the Student/ Learner. The number of queries present in the database and non database queries recognized ispresented by Receiver Operating Characteristics curve (ROC. The ROC curve lies above the diagonal of the ROC plot. This indicates that the proposed method of interaction of Robot with the learners isacceptable. The percentage of question recognition resulted in 74%.

  1. Influence of tinnitus percentage index of speech recognition in patients with normal hearing

    Directory of Open Access Journals (Sweden)

    Urnau, Daila

    2010-12-01

    Full Text Available Introduction: The understanding of speech is one of the most important measurable aspects of human auditory function. Tinnitus affects the quality of life, impairing communication. Objective: To investigate possible changes in the Percentage Index of Speech Recognition (SDT in individuals with tinnitus have normal hearing and examining the relationship between tinnitus, gender and age. Methods:A retrospective study by analyzing the records of 82 individuals of both genders, aged 21-70 years, totaling 128 ears with normal hearing. The ears were analyzed separately, and divided into control group, no complaints of tinnitus and group study, with complaints of tinnitus. The variables gender and age groups and examined the influence of tinnitus in the SDT. It was considered normal, the percentage of 100% correct and changed, and the value between 88-96%. These criteria were adopted, since the percentage below 88% correct is found in individuals with sensorineural hearing loss. Results:There was no statistically significant difference between the variables age and tinnitus, and tinnitus SDT, only gender and tinnitus. The prevalence of tinnitus in females (56%, higher incidence of tinnitus in the age group 31-40 years (41.67% and fewer from 41 to 50 years (18.75% and on the SDT there was a greater percentage change in individuals with tinnitus (61.11%. Conclusion: The buzz does not interfere with SDT and there is no relationship between tinnitus and age, only between tinnitus and gender.

  2. Silent Speech Recognition with Arabic and English Words for Vocally Disabled Persons

    Directory of Open Access Journals (Sweden)

    Sami Nassimi

    2014-05-01

    Full Text Available This paper presents the results of our research in silent speech recognition (SSR using Surface Electromyography (sEMG; which is the technology of recording the electric activation potentials of the human articulatory muscles by surface electrodes in order to recognize speech. Though SSR is still in the experimental stage, a number of potential applications seem evident. Persons who have undergone a laryngectomy, or older people for whom speaking requires a substantial effort, would be able to mouth (vocalize words rather than actually pronouncing them. Our system has been trained with 30 utterances from each of the three subjects we had on a testing vocabulary of 4 phrases, and then tested for 15 new utterances that were not part of the training list. The system achieved an average of 91.11% word accuracy when using Support Vector Machine (SVM classifier while the base language is English, and an average of 89.44% word accuracy using the Standard Arabic language.

  3. A comparison of speech recognition training programs for cochlear implant users: A simulation study

    Science.gov (United States)

    McCabe, Marie E.; Chiu, C.-Y. Peter

    2003-10-01

    The present simulaton study compared two training programs with very different design features to explore how each might improve the ability of listeners with normal hearing to recognize speech generated by a cochlear implant simulator. The first program, which focused training on specific areas of difficulty for individual patients across multiple levels of linguistic content (e.g., vowels, consonants, words, and sentences), was modeled after a standard program prescribed by one of the US manufacturers of cochlear implants. The second program consisted of exposure to multiple sentences with feedback regardless of subjects' performance level, and had been used in previous studies from this laboratory. All speech materials were reduced spectrally to simulate an 8-channel CIS cochlear implant processor with a ``6mm frequency upshift'' [Fu and Shannon, J. Acoust. Soc. Am. 105, 1889 (1999)]. Test sessions were administered to all subjects to assess recognition of sentences, consonants (/aCa/), and vowels (in /hVd/ and /bVt/ contexts) pre- and post-training. In a subset of subjects, a crossover design, in which subjects were trained first with one program and then with the other, was employed. Results will be discussed both in terms of theory and practice of therapeutic programs for cochlear implant users.

  4. Strategies for distant speech recognitionin reverberant environments

    Science.gov (United States)

    Delcroix, Marc; Yoshioka, Takuya; Ogawa, Atsunori; Kubo, Yotaro; Fujimoto, Masakiyo; Ito, Nobutaka; Kinoshita, Keisuke; Espi, Miquel; Araki, Shoko; Hori, Takaaki; Nakatani, Tomohiro

    2015-12-01

    Reverberation and noise are known to severely affect the automatic speech recognition (ASR) performance of speech recorded by distant microphones. Therefore, we must deal with reverberation if we are to realize high-performance hands-free speech recognition. In this paper, we review a recognition system that we developed at our laboratory to deal with reverberant speech. The system consists of a speech enhancement (SE) front-end that employs long-term linear prediction-based dereverberation followed by noise reduction. We combine our SE front-end with an ASR back-end that uses neural networks for acoustic and language modeling. The proposed system achieved top scores on the ASR task of the REVERB challenge. This paper describes the different technologies used in our system and presents detailed experimental results that justify our implementation choices and may provide hints for designing distant ASR systems.

  5. Automatic translation among spoken languages

    Science.gov (United States)

    Walter, Sharon M.; Costigan, Kelly

    1994-02-01

    The Machine Aided Voice Translation (MAVT) system was developed in response to the shortage of experienced military field interrogators with both foreign language proficiency and interrogation skills. Combining speech recognition, machine translation, and speech generation technologies, the MAVT accepts an interrogator's spoken English question and translates it into spoken Spanish. The spoken Spanish response of the potential informant can then be translated into spoken English. Potential military and civilian applications for automatic spoken language translation technology are discussed in this paper.

  6. Particle swarm optimization based feature enhancement and feature selection for improved emotion recognition in speech and glottal signals.

    Science.gov (United States)

    Muthusamy, Hariharan; Polat, Kemal; Yaacob, Sazali

    2015-01-01

    In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature. PMID:25799141

  7. Particle swarm optimization based feature enhancement and feature selection for improved emotion recognition in speech and glottal signals.

    Directory of Open Access Journals (Sweden)

    Hariharan Muthusamy

    Full Text Available In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs, linear predictive cepstral coefficients (LPCCs, perceptual linear predictive (PLP features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES signals and its glottal waveforms(GW. Particle swarm optimization based clustering (PSOC and wrapper based particle swarm optimization (WPSO were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature.

  8. On the Use of Evolutionary Algorithms to Improve the Robustness of Continuous Speech Recognition Systems in Adverse Conditions

    Directory of Open Access Journals (Sweden)

    Sid-Ahmed Selouani

    2003-07-01

    Full Text Available Limiting the decrease in performance due to acoustic environment changes remains a major challenge for continuous speech recognition (CSR systems. We propose a novel approach which combines the Karhunen-Loève transform (KLT in the mel-frequency domain with a genetic algorithm (GA to enhance the data representing corrupted speech. The idea consists of projecting noisy speech parameters onto the space generated by the genetically optimized principal axis issued from the KLT. The enhanced parameters increase the recognition rate for highly interfering noise environments. The proposed hybrid technique, when included in the front-end of an HTK-based CSR system, outperforms that of the conventional recognition process in severe interfering car noise environments for a wide range of signal-to-noise ratios (SNRs varying from 16 dB to −4 dB. We also showed the effectiveness of the KLT-GA method in recognizing speech subject to telephone channel degradations.

  9. Long-term outcomes on spatial hearing, speech recognition and receptive vocabulary after sequential bilateral cochlear implantation in children.

    Science.gov (United States)

    Sparreboom, Marloes; Langereis, Margreet C; Snik, Ad F M; Mylanus, Emmanuel A M

    2014-11-01

    Sequential bilateral cochlear implantation in profoundly deaf children often leads to primary advantages in spatial hearing and speech recognition. It is not yet known how these children develop in the long-term and if these primary advantages will also lead to secondary advantages, e.g. in better language skills. The aim of the present longitudinal cohort study was to assess the long-term effects of sequential bilateral cochlear implantation in children on spatial hearing, speech recognition in quiet and in noise and receptive vocabulary. Twenty-four children with bilateral cochlear implants (BiCIs) were tested 5-6 years after sequential bilateral cochlear implantation. These children received their second implant between 2.4 and 8.5 years of age. Speech and language data were also gathered in a matched reference group of 26 children with a unilateral cochlear implant (UCI). Spatial hearing was assessed with a minimum audible angle (MAA) task with different stimulus types to gain global insight into the effective use of interaural level difference (ILD) and interaural timing difference (ITD) cues. In the long-term, children still showed improvements in spatial acuity. Spatial acuity was highest for ILD cues compared to ITD cues. For speech recognition in quiet and noise, and receptive vocabulary, children with BiCIs had significant higher scores than children with a UCI. Results also indicate that attending a mainstream school has a significant positive effect on speech recognition and receptive vocabulary compared to attending a school for the deaf. Despite of a period of unilateral deafness, children with BiCIs, participating in mainstream education obtained age-appropriate language scores. PMID:25462493

  10. AUTOMATIC RECOGNITION OF PIPING SYSTEM FROM LARGE-SCALE TERRESTRIAL LASER SCAN DATA

    Directory of Open Access Journals (Sweden)

    K. Kawashima

    2012-09-01

    Full Text Available Recently, changes in plant equipment have been becoming more frequent because of the short lifetime of the products, and constructing 3D shape models of existing plants (as-built models from large-scale laser scanned data is expected to make their rebuilding processes more efficient. However, the laser scanned data of the existing plant has massive points, captures tangled objects and includes a large amount of noises, so that the manual reconstruction of a 3D model is very time-consuming and costs a lot. Piping systems especially, account for the greatest proportion of plant equipment. Therefore, the purpose of this research was to propose an algorithm which can automatically recognize a piping system from terrestrial laser scan data of the plant equipment. The straight portion of pipes, connecting parts and connection relationship of the piping system can be recognized in this algorithm. Eigenvalue analysis of the point clouds and of the normal vectors allows for the recognition. Using only point clouds, the recognition algorithm can be applied to registered point clouds and can be performed in a fully automatic way. The preliminary results of the recognition for large-scale scanned data from an oil rig plant have shown the effectiveness of the algorithm.

  11. A new method for extraction of speech features using spectral delta characteristics and invariant integration

    OpenAIRE

    FARSI, Hassan; KUHIMOGHADAM, Samana

    2014-01-01

    We propose a new feature extraction algorithm that is robust against noise. Nonlinear filtering and temporal masking are used for the proposed algorithm. Since the current automatic speech recognition systems use invariant-integration and delta-delta techniques for speech feature extraction, the proposed algorithm improves speech recognition accuracy appropriately using a delta-spectral feature instead of invariant integration. One of the nonenvironmental factors that reduce recognitio...

  12. Arabic natural language processing: handwriting recognition

    OpenAIRE

    Belaïd, Abdel

    2008-01-01

    The automatic recognition of Arabic writing is a very young research discipline with very challenging and significant problems. Indeed, with the air of the Internet, of Multimedia, the recognition of Arabic is useful to contributing like its close disciplines, Latin writing recognition, speech recognition and Vision processing, in current applications around digital libraries, document security and in numerical data processing in general. Arabic is a Semitic language spoken and understood in ...

  13. Assessing the impact of graphical quality on automatic text recognition in digital maps

    Science.gov (United States)

    Chiang, Yao-Yi; Leyk, Stefan; Honarvar Nazari, Narges; Moghaddam, Sima; Tan, Tian Xiang

    2016-08-01

    Converting geographic features (e.g., place names) in map images into a vector format is the first step for incorporating cartographic information into a geographic information system (GIS). With the advancement in computational power and algorithm design, map processing systems have been considerably improved over the last decade. However, the fundamental map processing techniques such as color image segmentation, (map) layer separation, and object recognition are sensitive to minor variations in graphical properties of the input image (e.g., scanning resolution). As a result, most map processing results would not meet user expectations if the user does not "properly" scan the map of interest, pre-process the map image (e.g., using compression or not), and train the processing system, accordingly. These issues could slow down the further advancement of map processing techniques as such unsuccessful attempts create a discouraged user community, and less sophisticated tools would be perceived as more viable solutions. Thus, it is important to understand what kinds of maps are suitable for automatic map processing and what types of results and process-related errors can be expected. In this paper, we shed light on these questions by using a typical map processing task, text recognition, to discuss a number of map instances that vary in suitability for automatic processing. We also present an extensive experiment on a diverse set of scanned historical maps to provide measures of baseline performance of a standard text recognition tool under varying map conditions (graphical quality) and text representations (that can vary even within the same map sheet). Our experimental results help the user understand what to expect when a fully or semi-automatic map processing system is used to process a scanned map with certain (varying) graphical properties and complexities in map content.

  14. Automatic track recognition for large-angle minimum ionizing particles in nuclear emulsions

    CERN Document Server

    Fukuda, T; Ishida, H; Matsumoto, T; Matsuo, T; Mikado, S; Nishimura, S; Ogawa, S; Shibuya, H; Sudou, J; Ariga, A; Tufanli, S

    2014-01-01

    We previously developed an automatic track scanning system which enables the detection of large-angle nuclear fragments in the nuclear emulsion films of the OPERA experiment. As a next step, we have investigated this system's track recognition capability for large-angle minimum ionizing particles $(1.0 \\leq |tan \\theta| \\leq 3.5)$. This paper shows that, for such tracks, the system has a detection efficiency of 95$\\%$ or higher and reports the achieved angular accuracy of the automatically recognized tracks. This technology is of general purpose and will likely contribute not only to various analyses in the OPERA experiment, but also to future experiments, e.g. on low-energy neutrino and hadron interactions, or to future research on cosmic rays using nuclear emulsions carried by balloons.

  15. Automatic track recognition for large-angle minimum ionizing particles in nuclear emulsions

    International Nuclear Information System (INIS)

    We previously developed an automatic track scanning system which enables the detection of large-angle nuclear fragments in the nuclear emulsion films of the OPERA experiment. As a next step, we have investigated this system's track recognition capability for large-angle minimum ionizing particles (1.0 ≤ |tan θ| ≤ 3.5). This paper shows that, for such tracks, the system has a detection efficiency of 95% or higher and reports the achieved angular accuracy of the automatically recognized tracks. This technology is of general purpose and will likely contribute not only to various analyses in the OPERA experiment, but also to future experiments, e.g. on low-energy neutrino and hadron interactions, or to future research on cosmic rays using nuclear emulsions carried by balloons

  16. Segmentation of the speech signal based on changes in energy distribution in the spectrum

    Science.gov (United States)

    Jassem, W.; Kudzdela, H.; Domagala, P.

    1983-08-01

    A simple algorithm is proposed for automatic phonetic segmentation of the acoustic speech signal on the MERA 303 desk-top minicomputer. The algorithm is verified with Polish linguistic material spoken by two subjects. The proposed algorithm detects approximately 80 percent of the boundaries between enunciated segments correctly, a result no worse than that obtained using more complex methods. Speech recognition programs are discussed as speech perception models, and the nature of categorical perception of human speech sounds is examined.

  17. Automatic Recognition of Sunspots in HSOS Full-Disk Solar Images

    CERN Document Server

    Zhao, Cui; Deng, YuanYong; Yang, Xiao

    2016-01-01

    A procedure is introduced to recognise sunspots automatically in solar full-disk photosphere images obtained from Huairou Solar Observing Station, National Astronomical Observatories of China. The images are first pre-processed through Gaussian algorithm. Sunspots are then recognised by the morphological Bot-hat operation and Otsu threshold. Wrong selection of sunspots is eliminated by a criterion of sunspot properties. Besides, in order to calculate the sunspots areas and the solar centre, the solar limb is extracted by a procedure using morphological closing and erosion operations and setting an adaptive threshold. Results of sunspot recognition reveal that the number of the sunspots detected by our procedure has a quite good agreement with the manual method. The sunspot recognition rate is 95% and error rate is 1.2%. The sunspot areas calculated by our method have high correlation (95%) with the area data from USAF/NOAA.

  18. Application of pattern recognition in molecular spectroscopy: Automatic line search in high-resolution spectra

    Science.gov (United States)

    Bykov, A. D.; Pshenichnikov, A. M.; Sinitsa, L. N.; Shcherbakov, A. P.

    2004-07-01

    An expert system has been developed for the initial analysis of a recorded spectrum, namely, for the line search and the determination of line positions and intensities. The expert system is based on pattern recognition algorithms. Object recognition learning allows the system to achieve the needed flexibility and automatically detect groups of overlapping lines, whose profiles should be fit together. Gauss, Lorentz, and Voigt profiles are used as model profiles to which spectral lines are fit. The expert system was applied to processing of the Fourier transform spectrum of the D2O molecule in the region 3200-4200 cm-1, and it detected 4670 lines in the spectrum, which consisted of 439000 dots. No one experimentally observed line exceeding the noise level was missed.

  19. Automatic recognition of light source from color negative films using sorting classification techniques

    Science.gov (United States)

    Sanger, Demas S.; Haneishi, Hideaki; Miyake, Yoichi

    1995-08-01

    This paper proposed a simple and automatic method for recognizing the light sources from various color negative film brands by means of digital image processing. First, we stretched the image obtained from a negative based on the standardized scaling factors, then extracted the dominant color component among red, green, and blue components of the stretched image. The dominant color component became the discriminator for the recognition. The experimental results verified that any one of the three techniques could recognize the light source from negatives of any film brands and all brands greater than 93.2 and 96.6% correct recognitions, respectively. This method is significant for the automation of color quality control in color reproduction from color negative film in mass processing and printing machine.

  20. Automatic recognition of polychlorinated biphenyls in gas-chromatographic/mass spectrometric analysis

    International Nuclear Information System (INIS)

    A computer code for automatic recognition of mass spectra of polychlorinated biphenyls (PCBs) has been developed and used as a specific PCB detector in the gas-chromatographic/mass spectrometric analysis. The recognition is based on numerical features to be extracted from the mass spectrum. The code is in Fortran. The result of a classification are the so-called classification chromatograms for the particular groups of PCBs of equal chlorine number. The practical application has been tested on water- and waste oil samples, with PCBs added. The sensitivity is 0,5-1 ng for separate PCB components and 5-20 ng for technical PCD mixtures. 59 refs., 50 figs., 5 tabs. (qui)

  1. Automatic Modulation Recognition Using Wavelet Transform and Neural Networks in Wireless Systems

    Science.gov (United States)

    Hassan, K.; Dayoub, I.; Hamouda, W.; Berbineau, M.

    2010-12-01

    Modulation type is one of the most important characteristics used in signal waveform identification. In this paper, an algorithm for automatic digital modulation recognition is proposed. The proposed algorithm is verified using higher-order statistical moments (HOM) of continuous wavelet transform (CWT) as a features set. A multilayer feed-forward neural network trained with resilient backpropagation learning algorithm is proposed as a classifier. The purpose is to discriminate among different M-ary shift keying modulation schemes and the modulation order without any priori signal information. Pre-processing and features subset selection using principal component analysis is used to reduce the network complexity and to improve the classifier's performance. The proposed algorithm is evaluated through confusion matrix and false recognition probability. The proposed classifier is shown to be capable of recognizing the modulation scheme with high accuracy over wide signal-to-noise ratio (SNR) range over both additive white Gaussian noise (AWGN) and different fading channels.

  2. Automatic Modulation Recognition Using Wavelet Transform and Neural Networks in Wireless Systems

    Directory of Open Access Journals (Sweden)

    Dayoub I

    2010-01-01

    Full Text Available Modulation type is one of the most important characteristics used in signal waveform identification. In this paper, an algorithm for automatic digital modulation recognition is proposed. The proposed algorithm is verified using higher-order statistical moments (HOM of continuous wavelet transform (CWT as a features set. A multilayer feed-forward neural network trained with resilient backpropagation learning algorithm is proposed as a classifier. The purpose is to discriminate among different M-ary shift keying modulation schemes and the modulation order without any priori signal information. Pre-processing and features subset selection using principal component analysis is used to reduce the network complexity and to improve the classifier's performance. The proposed algorithm is evaluated through confusion matrix and false recognition probability. The proposed classifier is shown to be capable of recognizing the modulation scheme with high accuracy over wide signal-to-noise ratio (SNR range over both additive white Gaussian noise (AWGN and different fading channels.

  3. Automatic Recognition of Sunspots in HSOS Full-Disk Solar Images

    Science.gov (United States)

    Zhao, Cui; Lin, GangHua; Deng, YuanYong; Yang, Xiao

    2016-05-01

    A procedure is introduced to recognise sunspots automatically in solar full-disk photosphere images obtained from Huairou Solar Observing Station, National Astronomical Observatories of China. The images are first pre-processed through Gaussian algorithm. Sunspots are then recognised by the morphological Bot-hat operation and Otsu threshold. Wrong selection of sunspots is eliminated by a criterion of sunspot properties. Besides, in order to calculate the sunspots areas and the solar centre, the solar limb is extracted by a procedure using morphological closing and erosion operations and setting an adaptive threshold. Results of sunspot recognition reveal that the number of the sunspots detected by our procedure has a quite good agreement with the manual method. The sunspot recognition rate is 95% and error rate is 1.2%. The sunspot areas calculated by our method have high correlation (95%) with the area data from the United States Air Force/National Oceanic and Atmospheric Administration (USAF/NOAA).

  4. Modern prescription theory and application: realistic expectations for speech recognition with hearing AIDS.

    Science.gov (United States)

    Johnson, Earl E

    2013-01-01

    A major decision at the time of hearing aid fitting and dispensing is the amount of amplification to provide listeners (both adult and pediatric populations) for the appropriate compensation of sensorineural hearing impairment across a range of frequencies (e.g., 160-10000 Hz) and input levels (e.g., 50-75 dB sound pressure level). This article describes modern prescription theory for hearing aids within the context of a risk versus return trade-off and efficient frontier analyses. The expected return of amplification recommendations (i.e., generic prescriptions such as National Acoustic Laboratories-Non-Linear 2, NAL-NL2, and Desired Sensation Level Multiple Input/Output, DSL m[i/o]) for the Speech Intelligibility Index (SII) and high-frequency audibility were traded against a potential risk (i.e., loudness). The modeled performance of each prescription was compared one with another and with the efficient frontier of normal hearing sensitivity (i.e., a reference point for the most return with the least risk). For the pediatric population, NAL-NL2 was more efficient for SII, while DSL m[i/o] was more efficient for high-frequency audibility. For the adult population, NAL-NL2 was more efficient for SII, while the two prescriptions were similar with regard to high-frequency audibility. In terms of absolute return (i.e., not considering the risk of loudness), however, DSL m[i/o] prescribed more outright high-frequency audibility than NAL-NL2 for either aged population, particularly, as hearing loss increased. Given the principles and demonstrated accuracy of desensitization (reduced utility of audibility with increasing hearing loss) observed at the group level, additional high-frequency audibility beyond that of NAL-NL2 is not expected to make further contributions to speech intelligibility (recognition) for the average listener. PMID:24253361

  5. Development and Evaluation of a Speech Recognition Test for Persian Speaking Adults

    Directory of Open Access Journals (Sweden)

    Mohammad Mosleh

    2001-05-01

    Full Text Available Method and Materials: This research is carried out for development and evaluation of 25 phonemically balanced word lists for Persian speaking adults in two separate stages: development and evaluation. In the first stage, in order to balance the lists phonemically, frequency -of- occurrences of each 29phonems (6 vowels and 23 Consonants of the Persian language in adults speech are determined. This section showed some significant differences between some phonemes' frequencies. Then, all Persian monosyllabic words extracted from the Mo ‘in Persian dictionary. The semantically difficult words were refused and the appropriate words choosed according to judgment of 5 adult native speakers of Persian with high school diploma. 12 openset 25 word lists are prepared. The lists were recorded on magnetic tapes in an audio studio by a professional speaker of IRIB. "nIn the second stage, in order to evaluate the test's validity and reliability, 60 normal hearing adults (30 male, 30 female, were randomly selected and evaluated as test and retest. Findings: 1- Normal hearing adults obtained 92-1 0O scores for each list at their MCL through test-retest. 2- No significant difference was observed a/ in test-retest scores in each list (‘P>O.05 b/ between the lists at test or retest scores (P>0.05, c/between sex (P>0.05. Conclusion: This research is reliable and valid, the lists are phonemically balanced and equal in difficulty and valuable for evaluation of Persian speaking adults speech recognition.

  6. Monitoring caustic injuries from emergency department databases using automatic keyword recognition software

    Science.gov (United States)

    Vignally, P.; Fondi, G.; Taggi, F.; Pitidis, A.; National Injury Database and National Information System on Accidents in the Home Surveillance Groups

    2011-01-01

    Summary In Italy the European Union Injury Database reports the involvement of chemical products in 0.9% of home and leisure accidents. The Emergency Department registry on domestic accidents in Italy and the Poison Control Centres record that 90% of cases of exposure to toxic substances occur in the home. It is not rare for the effects of chemical agents to be observed in hospitals, with a high potential risk of damage - the rate of this cause of hospital admission is double the domestic injury average. The aim of this study was to monitor the effects of injuries caused by caustic agents in Italy using automatic free-text recognition in Emergency Department medical databases. We created a Stata software program to automatically identify caustic or corrosive injury cases using an agent-specific list of keywords. We focused attention on the procedure’s sensitivity and specificity. Ten hospitals in six regions of Italy participated in the study. The program identified 112 cases of injury by caustic or corrosive agents. Checking the cases by quality controls (based on manual reading of ED reports), we assessed 99 cases as true positive, i.e. 88.4% of the patients were automatically recognized by the software as being affected by caustic substances (99% CI: 80.6%- 96.2%), that is to say 0.59% (99% CI: 0.45%-0.76%) of the whole sample of home injuries, a value almost three times as high as that expected (p < 0.0001) from European codified information. False positives were 11.6% of the recognized cases (99% CI: 5.1%- 21.5%). Our automatic procedure for caustic agent identification proved to have excellent product recognition capacity with an acceptable level of excess sensitivity. Contrary to our a priori hypothesis, the automatic recognition system provided a level of identification of agents possessing caustic effects that was significantly much greater than was predictable on the basis of the values from current codifications reported in the European Database. PMID

  7. Automatic Human Gait Imitation and Recognition in 3D from Monocular Video with an Uncalibrated Camera

    Directory of Open Access Journals (Sweden)

    Tao Yu

    2012-01-01

    Full Text Available A framework of imitating real human gait in 3D from monocular video of an uncalibrated camera directly and automatically is proposed. It firstly combines polygon-approximation with deformable template-matching, using knowledge of human anatomy to achieve the characteristics including static and dynamic parameters of real human gait. Then, these characteristics are processed in regularization and normalization. Finally, they are imposed on a 3D human motion model with prior constrains and universal gait knowledge to realize imitating human gait. In recognition based on this human gait imitation, firstly, the dimensionality of time-sequences corresponding to motion curves is reduced by NPE. Then, we use the essential features acquired from human gait imitation as input and integrate HCRF with SVM as a whole classifier, realizing identification recognition on human gait. In associated experiment, this imitation framework is robust for the object’s clothes and backpacks to a certain extent. It does not need any manual assist and any camera model information. And it is fitting for straight indoors and the viewing angle for target is between 60° and 120°. In recognition testing, this kind of integrated classifier HCRF/SVM has comparatively higher recognition rate than the sole HCRF, SVM and typical baseline method.

  8. Language and Speech Processing

    CERN Document Server

    Mariani, Joseph

    2008-01-01

    Speech processing addresses various scientific and technological areas. It includes speech analysis and variable rate coding, in order to store or transmit speech. It also covers speech synthesis, especially from text, speech recognition, including speaker and language identification, and spoken language understanding. This book covers the following topics: how to realize speech production and perception systems, how to synthesize and understand speech using state-of-the-art methods in signal processing, pattern recognition, stochastic modelling computational linguistics and human factor studi

  9. An Agent-based Framework for Speech Investigation

    OpenAIRE

    Walsh, Michael; O'Hare, G.M.P.; Carson-Berndsen, Julie

    2005-01-01

    This paper presents a novel agent-based framework for investigating speech recognition which combines statistical data and explicit phonological knowledge in order to explore strategies aimed at augmenting the performance of automatic speech recognition (ASR) systems. This line of research is motivated by a desire to provide solutions to some of the more notable problems encountered, including in particular the problematic phenomena of coarticulation, underspecified input...

  10. Noise Estimation and Noise Removal Techniques for Speech Recognition in Adverse Environment

    OpenAIRE

    Shrawankar, Urmila; Thakare, Vilas

    2010-01-01

    Noise is ubiquitous in almost all acoustic environments. The speech signal, that is recorded by a microphone is generally infected by noise originating from various sources. Such contamination can change the characteristics of the speech signals and degrade the speech quality and intelligibility, thereby causing significant harm to human-to-machine communication systems. Noise detection and reduction for speech applications is often formulated as a digital filtering problem, where the clean s...

  11. Managing predefined templates and macros for a departmental speech recognition system using common software.

    Science.gov (United States)

    Sistrom, C L; Honeyman, J C; Mancuso, A; Quisling, R G

    2001-09-01

    The authors have developed a networked database system to create, store, and manage predefined radiology report definitions. This was prompted by complete departmental conversion to a computer speech recognition system (SRS) for clinical reporting. The software complements and extends the capabilities of the SRS, and 2 systems are integrated by means of a simple text file format and import/export functions within each program. This report describes the functional requirements, design considerations, and implementation details of the structured report management software. The database and its interface are designed to allow all radiologists and division managers to define and update template structures relevant to their practice areas. Two key conceptual extensions supported by the template management system are the addition of a template type construct and allowing individual radiologists to dynamically share common organ system or modality-specific templates. In addition, the template manager software enables specifying predefined report structures that can be triggered at the time of dictation from printed lists of barcodes. Initial experience using the program in a regional, multisite, academic radiology practice has been positive. PMID:11720335

  12. Development of a two wheeled self balancing robot with speech recognition and navigation algorithm

    Science.gov (United States)

    Rahman, Md. Muhaimin; Ashik-E-Rasul, Haq, Nowab. Md. Aminul; Hassan, Mehedi; Hasib, Irfan Mohammad Al; Hassan, K. M. Rafidh

    2016-07-01

    This paper is aimed to discuss modeling, construction and development of navigation algorithm of a two wheeled self balancing mobile robot in an enclosure. In this paper, we have discussed the design of two of the main controller algorithms, namely PID algorithms, on the robot model. Simulation is performed in the SIMULINK environment. The controller is developed primarily for self-balancing of the robot and also it's positioning. As for the navigation in an enclosure, template matching algorithm is proposed for precise measurement of the robot position. The navigation system needs to be calibrated before navigation process starts. Almost all of the earlier template matching algorithms that can be found in the open literature can only trace the robot. But the proposed algorithm here can also locate the position of other objects in an enclosure, like furniture, tables etc. This will enable the robot to know the exact location of every stationary object in the enclosure. Moreover, some additional features, such as Speech Recognition and Object Detection, are added. For Object Detection, the single board Computer Raspberry Pi is used. The system is programmed to analyze images captured via the camera, which are then processed through background subtraction, followed by active noise reduction.

  13. Automatic recognition of cardiac arrhythmias based on the geometric patterns of Poincaré plots

    International Nuclear Information System (INIS)

    The Poincaré plot emerges as an effective tool for assessing cardiovascular autonomic regulation. It displays nonlinear characteristics of heart rate variability (HRV) from electrocardiographic (ECG) recordings and gives a global view of the long range of ECG signals. In the telemedicine or computer-aided diagnosis system, it would offer significant auxiliary information for diagnosis if the patterns of the Poincaré plots can be automatically classified. Therefore, we developed an automatic classification system to distinguish five geometric patterns of the Poincaré plots from four types of cardiac arrhythmias. The statistics features are designed on measurements and an ensemble classifier of three types of neural networks is proposed. Aiming at the difficulty to set a proper threshold for classifying the multiple categories, the threshold selection strategy is analyzed. 24 h ECG monitoring recordings from 674 patients, which have four types of cardiac arrhythmias, are adopted for recognition. For comparison, Support Vector Machine (SVM) classifiers with linear and Gaussian kernels are also applied. The experiment results demonstrate the effectiveness of the extracted features and the better performance of the designed classifier. Our study can be applied to diagnose the corresponding sinus rhythm and arrhythmia substrates disease automatically in the telemedicine and computer-aided diagnosis system. (paper)

  14. Automatic recognition of cardiac arrhythmias based on the geometric patterns of Poincaré plots.

    Science.gov (United States)

    Zhang, Lijuan; Guo, Tianci; Xi, Bin; Fan, Yang; Wang, Kun; Bi, Jiacheng; Wang, Ying

    2015-02-01

    The Poincaré plot emerges as an effective tool for assessing cardiovascular autonomic regulation. It displays nonlinear characteristics of heart rate variability (HRV) from electrocardiographic (ECG) recordings and gives a global view of the long range of ECG signals. In the telemedicine or computer-aided diagnosis system, it would offer significant auxiliary information for diagnosis if the patterns of the Poincaré plots can be automatically classified. Therefore, we developed an automatic classification system to distinguish five geometric patterns of the Poincaré plots from four types of cardiac arrhythmias. The statistics features are designed on measurements and an ensemble classifier of three types of neural networks is proposed. Aiming at the difficulty to set a proper threshold for classifying the multiple categories, the threshold selection strategy is analyzed. 24 h ECG monitoring recordings from 674 patients, which have four types of cardiac arrhythmias, are adopted for recognition. For comparison, Support Vector Machine (SVM) classifiers with linear and Gaussian kernels are also applied. The experiment results demonstrate the effectiveness of the extracted features and the better performance of the designed classifier. Our study can be applied to diagnose the corresponding sinus rhythm and arrhythmia substrates disease automatically in the telemedicine and computer-aided diagnosis system. PMID:25582837

  15. Automatic anatomy recognition in post-tonsillectomy MR images of obese children with OSAS

    Science.gov (United States)

    Tong, Yubing; Udupa, Jayaram K.; Odhner, Dewey; Sin, Sanghun; Arens, Raanan

    2015-03-01

    Automatic Anatomy Recognition (AAR) is a recently developed approach for the automatic whole body wide organ segmentation. We previously tested that methodology on image cases with some pathology where the organs were not distorted significantly. In this paper, we present an advancement of AAR to handle organs which may have been modified or resected by surgical intervention. We focus on MRI of the neck in pediatric Obstructive Sleep Apnea Syndrome (OSAS). The proposed method consists of an AAR step followed by support vector machine techniques to detect the presence/absence of organs. The AAR step employs a hierarchical organization of the organs for model building. For each organ, a fuzzy model over a population is built. The model of the body region is then described in terms of the fuzzy models and a host of other descriptors which include parent to offspring relationship estimated over the population. Organs are recognized following the organ hierarchy by using an optimal threshold based search. The SVM step subsequently checks for evidence of the presence of organs. Experimental results show that AAR techniques can be combined with machine learning strategies within the AAR recognition framework for good performance in recognizing missing organs, in our case missing tonsils in post-tonsillectomy images as well as in simulating tonsillectomy images. The previous recognition performance is maintained achieving an organ localization accuracy of within 1 voxel when the organ is actually not removed. To our knowledge, no methods have been reported to date for handling significantly deformed or missing organs, especially in neck MRI.

  16. Annotating Speech Corpus for Prosody Modeling in Indian Language Text to Speech Systems

    Directory of Open Access Journals (Sweden)

    Kiruthiga S

    2012-01-01

    Full Text Available A spoken language system, it may either be a speech synthesis or a speech recognition system, starts with building a speech corpora. We give a detailed survey of issues and a methodology that selects the appropriate speech unit in building a speech corpus for Indian language Text to Speech systems. The paper ultimately aims to improve the intelligibility of the synthesized speech in Text to Speech synthesis systems. To begin with, an appropriate text file should be selected for building the speech corpus. Then a corresponding speech file is generated and stored. This speech file is the phonetic representation of the selected text file. The speech file is processed in different levels viz., paragraphs, sentences, phrases, words, syllables and phones. These are called the speech units of the file. Researches have been done taking these units as the basic unit for processing. This paper analyses the researches done using phones, diphones, triphones, syllables and polysyllables as their basic unit for speech synthesis. The paper also provides a recommended set of combinations for polysyllables. Concatenative speech synthesis involves the concatenation of these basic units to synthesize an intelligent, natural sounding speech. The speech units are annotated with relevant prosodic information about each unit, manually or automatically, based on an algorithm. The database consisting of the units along with their annotated information is called as the annotated speech corpus. A Clustering technique is used in the annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit.

  17. Combining Statistical Parameteric Speech Synthesis and Unit-Selection for Automatic Voice Cloning

    OpenAIRE

    Aylett, Matthew; Yamagishi, Junichi

    2008-01-01

    The ability to use the recorded audio of a subject's voice to produce an open-domain synthesis system has generated much interest both in academic research and in commercial speech technology. The ability to produce synthetic versions of a subjects voice has potential commercial applications, such as virtual celebrity actors, or potential clinical applications, such as offering a synthetic replacement voice in the case of a laryngectomy. Recent developments in HMM-based speech synthesis have ...

  18. Parallel System Architecture (PSA): An efficient approach for automatic recognition of volcano-seismic events

    Science.gov (United States)

    Cortés, Guillermo; García, Luz; Álvarez, Isaac; Benítez, Carmen; de la Torre, Ángel; Ibáñez, Jesús

    2014-02-01

    Automatic recognition of volcano-seismic events is becoming one of the most demanded features in the early warning area at continuous monitoring facilities. While human-driven cataloguing is time-consuming and often an unreliable task, an appropriate machine framework allows expert technicians to focus only on result analysis and decision-making. This work presents an alternative to serial architectures used in classic recognition systems introducing a parallel implementation of the whole process: configuration, feature extraction, feature selection and classification stages are independently carried out for each type of events in order to exploit the intrinsic properties of each signal class. The system uses Gaussian Mixture Models (GMMs) to classify the database recorded at Deception Volcano Island (Antarctica) obtaining a baseline recognition rate of 84% with a cepstral-based waveform parameterization in the serial architecture. The parallel approach increases the results to close to 92% using mixture-based parameterization vectors or up to 91% when the vector size is reduced by 19% via the Discriminative Feature Selection (DFS) algorithm. Besides the result improvement, the parallel architecture represents a major step in terms of flexibility and reliability thanks to the class-focused analysis, providing an efficient tool for monitoring observatories which require real-time solutions.

  19. Speech recognition in reverberant and noisy environments employing multiple feature extractors and i-vector speaker adaptation

    Science.gov (United States)

    Alam, Md Jahangir; Gupta, Vishwa; Kenny, Patrick; Dumouchel, Pierre

    2015-12-01

    The REVERB challenge provides a common framework for the evaluation of feature extraction techniques in the presence of both reverberation and additive background noise. State-of-the-art speech recognition systems perform well in controlled environments, but their performance degrades in realistic acoustical conditions, especially in real as well as simulated reverberant environments. In this contribution, we utilize multiple feature extractors including the conventional mel-filterbank, multi-taper spectrum estimation-based mel-filterbank, robust mel and compressive gammachirp filterbank, iterative deconvolution-based dereverberated mel-filterbank, and maximum likelihood inverse filtering-based dereverberated mel-frequency cepstral coefficient features for speech recognition with multi-condition training data. In order to improve speech recognition performance, we combine their results using ROVER (Recognizer Output Voting Error Reduction). For two- and eight-channel tasks, to get benefited from the multi-channel data, we also use ROVER, instead of the multi-microphone signal processing method, to reduce word error rate by selecting the best scoring word at each channel. As in a previous work, we also apply i-vector-based speaker adaptation which was found effective. In speech recognition task, speaker adaptation tries to reduce mismatch between the training and test speakers. Speech recognition experiments are conducted on the REVERB challenge 2014 corpora using the Kaldi recognizer. In our experiments, we use both utterance-based batch processing and full batch processing. In the single-channel task, full batch processing reduced word error rate (WER) from 10.0 to 9.3 % on SimData as compared to utterance-based batch processing. Using full batch processing, we obtained an average WER of 9.0 and 23.4 % on the SimData and RealData, respectively, for the two-channel task, whereas for the eight-channel task on the SimData and RealData, the average WERs found were 8

  20. Semantic and Phonetic Automatic Reconstruction of Medical Dictations

    OpenAIRE

    Petrik, Stefan; Drexel, Christina; Fessler, Leo; Jancsary, Jeremy; Klein, Alexandra; Kubin, Gernot; Matiasek, Johannes; Pernkopf, Franz; Trost, Harald

    2010-01-01

    Abstract Automatic speech recognition (ASR) has become a valuable tool in large document production environments like medical dictation. While manual post-processing is still needed for correcting speech-recognition errors and for creating documents which adhere to various stylistic and formatting conventions, a large part of the document production process is carried out by the ASR system. For improving the quality of the system output, knowledge about the multi-layered relationsh...

  1. Development of Portable Automatic Number Plate Recognition System on Android Mobile Phone

    Science.gov (United States)

    Mutholib, Abdul; Gunawan, Teddy S.; Chebil, Jalel; Kartiwi, Mira

    2013-12-01

    The Automatic Number Plate Recognition (ANPR) System has performed as the main role in various access control and security, such as: tracking of stolen vehicles, traffic violations (speed trap) and parking management system. In this paper, the portable ANPR implemented on android mobile phone is presented. The main challenges in mobile application are including higher coding efficiency, reduced computational complexity, and improved flexibility. Significance efforts are being explored to find suitable and adaptive algorithm for implementation of ANPR on mobile phone. ANPR system for mobile phone need to be optimize due to its limited CPU and memory resources, its ability for geo-tagging image captured using GPS coordinates and its ability to access online database to store the vehicle's information. In this paper, the design of portable ANPR on android mobile phone will be described as follows. First, the graphical user interface (GUI) for capturing image using built-in camera was developed to acquire vehicle plate number in Malaysia. Second, the preprocessing of raw image was done using contrast enhancement. Next, character segmentation using fixed pitch and an optical character recognition (OCR) using neural network were utilized to extract texts and numbers. Both character segmentation and OCR were using Tesseract library from Google Inc. The proposed portable ANPR algorithm was implemented and simulated using Android SDK on a computer. Based on the experimental results, the proposed system can effectively recognize the license plate number at 90.86%. The required processing time to recognize a license plate is only 2 seconds on average. The result is consider good in comparison with the results obtained from previous system that was processed in a desktop PC with the range of result from 91.59% to 98% recognition rate and 0.284 second to 1.5 seconds recognition time.

  2. Using Fuzzy Modifier in Similarity Measure of Fuzzy Attribute Graph and Its Automatic Selection in Structural Pattern Recognition

    OpenAIRE

    Payman Moallem

    2007-01-01

    Fuzzy Attribute Graph (FAG) is a powerful tool for representation and recognition of structural patterns. The conventional framework for similarity measure of FAGs is based on equivalent fuzzy attributes but in fuzzy world, some attributes are more important. In this paper, a modified recognition framework, using linguistic modifier for matching of the fuzzy attribute graphs, is introduced. Then an algorithm for automatic selection of fuzzy modifier based on the learning patterns is posed. So...

  3. Speech Recognition Performance in Children with Cochlear Implants Using Bimodal Stimulation

    OpenAIRE

    Rathna Kumar, S. B.; Mohanty, P.; Prakash, S. G. R.

    2010-01-01

    Cochlear implantees have considerably good speech understanding abilities in quiet surroundings. But, ambient noise poses significant difficulties in understanding speech for these individuals. Bimodal stimulation is still not used by many Indian implantees in spite of reports that bimodal stimulation is beneficial for speech understanding in noise as compared to cochlear implant alone and also prevents auditory deprivation in the un-implanted ear. The aim of the study is to evaluate the bene...

  4. Comparison of the South African Spondaic and CID W-1 wordlists for measuring speech recognition threshold

    Directory of Open Access Journals (Sweden)

    Tanya Hanekom

    2015-02-01

    Full Text Available Background: The home language of most audiologists in South Africa is either English or Afrikaans, whereas most South Africans speak an African language as their home language. The use of an English wordlist, the South African Spondaic (SAS wordlist, which is familiar to the English Second Language (ESL population, was developed by the author for testing the speech recognition threshold (SRT of ESL speakers. Objectives: The aim of this study was to compare the pure-tone average (PTA/SRT correlation results of ESL participants when using the SAS wordlist (list A and the CID W-1 spondaic wordlist (list B – less familiar; list C – more familiar CID W-1 words. Method: A mixed-group correlational, quantitative design was adopted. PTA and SRT measurements were compared for lists A, B and C for 101 (197 ears ESL participants with normal hearing or a minimal hearing loss (<26 dBHL; mean age 33.3. Results: The Pearson correlation analysis revealed a strong PTA/SRT correlation when using list A (right 0.65; left 0.58 and list C (right 0.63; left 0.56. The use of list B revealed weak correlations (right 0.30; left 0.32. Paired sample t-tests indicated a statistically significantly stronger PTA/SRT correlation when list A was used, rather than list B or list C, at a 95% level of confidence. Conclusions: The use of the SAS wordlist yielded a stronger PTA/SRT correlation than the use of the CID W-1 wordlist, when performing SRT testing on South African ESL speakers with normal hearing, or minimal hearing loss (<26 dBHL.

  5. Speech recognition software and electronic psychiatric progress notes: physicians' ratings and preferences

    Directory of Open Access Journals (Sweden)

    Derman Yaron D

    2010-08-01

    Full Text Available Abstract Background The context of the current study was mandatory adoption of electronic clinical documentation within a large mental health care organization. Psychiatric electronic documentation has unique needs by the nature of dense narrative content. Our goal was to determine if speech recognition (SR would ease the creation of electronic progress note (ePN documents by physicians at our institution. Methods Subjects: Twelve physicians had access to SR software on their computers for a period of four weeks to create ePN. Measurements: We examined SR software in relation to its perceived usability, data entry time savings, impact on the quality of care and quality of documentation, and the impact on clinical and administrative workflow, as compared to existing methods for data entry. Data analysis: A series of Wilcoxon signed rank tests were used to compare pre- and post-SR measures. A qualitative study design was used. Results Six of twelve participants completing the study favoured the use of SR (five with SR alone plus one with SR via hand-held digital recorder for creating electronic progress notes over their existing mode of data entry. There was no clear perceived benefit from SR in terms of data entry time savings, quality of care, quality of documentation, or impact on clinical and administrative workflow. Conclusions Although our findings are mixed, SR may be a technology with some promise for mental health documentation. Future investigations of this nature should use more participants, a broader range of document types, and compare front- and back-end SR methods.

  6. Semi-automatic parking slot marking recognition for intelligent parking assist systems

    Directory of Open Access Journals (Sweden)

    Ho Gi Jung

    2014-01-01

    Full Text Available This paper proposes a semi-automatic parking slot marking-based target position designation method for parking assist systems in cases where the parking slot markings are of a rectangular type, and its efficient implementation for real-time operation. After the driver observes a rearview image captured by a rearward camera installed at the rear of the vehicle through a touchscreen-based human machine interface, a target parking position is designated by touching the inside of a parking slot. To ensure the proposed method operates in real-time in an embedded environment, access of the bird's-eye view image is made efficient: image-wise batch transformation is replaced with pixel-wise instantaneous transformation. The proposed method showed a 95.5% recognition rate in 378 test cases with 63 test images. Additionally, experiments confirmed that the pixel-wise instantaneous transformation reduced execution time by 92%.

  7. Index of Garbledness for Automatic Recognition of Plain English Texts (Short Communication

    Directory of Open Access Journals (Sweden)

    P.K. Saxena

    2010-07-01

    Full Text Available In this paper, an Index of Garbledness (IG has been defined for automatic recognition of plain English texts based on linguistic characteristics of English language without using a dictionary. It also works for continuous text without word break-up (text without blank spaces between words. These characteristics, being vague in nature, are suitably represented through fuzzy sets. A fuzzy similarity relation and a fuzzy dissimilarity measure have been used to define this Index. Based on a threshold value of the Index, one can test whether the given text (continuous without word break-up is a plain English text or not. In case the text under consideration is not a plain text, it also gives an indication to what extent it is garbled.Defence Science Journal, 2010, 60(4, pp.415-419, DOI:http://dx.doi.org/10.14429/dsj.60.501

  8. Automatic facial feature extraction and expression recognition based on neural network

    CERN Document Server

    Khandait, S P; Khandait, P D

    2012-01-01

    In this paper, an approach to the problem of automatic facial feature extraction from a still frontal posed image and classification and recognition of facial expression and hence emotion and mood of a person is presented. Feed forward back propagation neural network is used as a classifier for classifying the expressions of supplied face into seven basic categories like surprise, neutral, sad, disgust, fear, happy and angry. For face portion segmentation and localization, morphological image processing operations are used. Permanent facial features like eyebrows, eyes, mouth and nose are extracted using SUSAN edge detection operator, facial geometry, edge projection analysis. Experiments are carried out on JAFFE facial expression database and gives better performance in terms of 100% accuracy for training set and 95.26% accuracy for test set.

  9. Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

    Institute of Scientific and Technical Information of China (English)

    Jie ZHOU; Bi-cheng LI; Gang CHEN

    2015-01-01

    Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from Chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of Chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of Chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from Chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.

  10. Model-based vision system for automatic recognition of structures in dental radiographs

    Science.gov (United States)

    Acharya, Raj S.; Samarabandu, Jagath K.; Hausmann, E.; Allen, K. A.

    1991-07-01

    X-ray diagnosis of destructive periodontal disease requires assessing serial radiographs by an expert to determine the change in the distance between cemento-enamel junction (CEJ) and the bone crest. To achieve this without the subjectivity of a human expert, a knowledge based system is proposed to automatically locate the two landmarks which are the CEJ and the level of alveolar crest at its junction with the periodontal ligament space. This work is a part of an ongoing project to automatically measure the distance between CEJ and the bone crest along a line parallel to the axis of the tooth. The approach presented in this paper is based on identifying a prominent feature such as the tooth boundary using local edge detection and edge thresholding to establish a reference and then using model knowledge to process sub-regions in locating the landmarks. Segmentation techniques invoked around these regions consists of a neural-network like hierarchical refinement scheme together with local gradient extraction, multilevel thresholding and ridge tracking. Recognition accuracy is further improved by first locating the easily identifiable parts of the bone surface and the interface between the enamel and the dentine and then extending these boundaries towards the periodontal ligament space and the tooth boundary respectively. The system is realized as a collection of tools (or knowledge sources) for pre-processing, segmentation, primary and secondary feature detection and a control structure based on the blackboard model to coordinate the activities of these tools.

  11. An Efficient Multimodal 2D + 3D Feature-based Approach to Automatic Facial Expression Recognition

    KAUST Repository

    Li, Huibin

    2015-07-29

    We present a fully automatic multimodal 2D + 3D feature-based facial expression recognition approach and demonstrate its performance on the BU-3DFE database. Our approach combines multi-order gradient-based local texture and shape descriptors in order to achieve efficiency and robustness. First, a large set of fiducial facial landmarks of 2D face images along with their 3D face scans are localized using a novel algorithm namely incremental Parallel Cascade of Linear Regression (iPar-CLR). Then, a novel Histogram of Second Order Gradients (HSOG) based local image descriptor in conjunction with the widely used first-order gradient based SIFT descriptor are used to describe the local texture around each 2D landmark. Similarly, the local geometry around each 3D landmark is described by two novel local shape descriptors constructed using the first-order and the second-order surface differential geometry quantities, i.e., Histogram of mesh Gradients (meshHOG) and Histogram of mesh Shape index (curvature quantization, meshHOS). Finally, the Support Vector Machine (SVM) based recognition results of all 2D and 3D descriptors are fused at both feature-level and score-level to further improve the accuracy. Comprehensive experimental results demonstrate that there exist impressive complementary characteristics between the 2D and 3D descriptors. We use the BU-3DFE benchmark to compare our approach to the state-of-the-art ones. Our multimodal feature-based approach outperforms the others by achieving an average recognition accuracy of 86.32%. Moreover, a good generalization ability is shown on the Bosphorus database.

  12. Silent Speech Interfaces

    OpenAIRE

    Denby, B; Schultz, T.; Honda, K.; Hueber, T.; Gilbert, J.M.; Brumberg, J.S.

    2010-01-01

    Abstract The possibility of speech processing in the absence of an intelligible acoustic signal has given rise to the idea of a `silent speech? interface, to be used as an aid for the speech handicapped, or as part of a communications system operating in silence-required or high-background-noise environments. The article first outlines the emergence of the silent speech interface from the fields of speech production, automatic speech processing, speech pathology research, and telec...

  13. Reverberant speech recognition combining deep neural networks and deep autoencoders augmented with a phone-class feature

    Science.gov (United States)

    Mimura, Masato; Sakai, Shinsuke; Kawahara, Tatsuya

    2015-12-01

    We propose an approach to reverberant speech recognition adopting deep learning in the front-end as well as b a c k-e n d o f a r e v e r b e r a n t s p e e c h r e c o g n i t i o n s y s t e m, a n d a n o v e l m e t h o d t o i m p r o v e t h e d e r e v e r b e r a t i o n p e r f o r m a n c e of the front-end network using phone-class information. At the front-end, we adopt a deep autoencoder (DAE) for enhancing the speech feature parameters, and speech recognition is performed in the back-end using DNN-HMM acoustic models trained on multi-condition data. The system was evaluated through the ASR task in the Reverb Challenge 2014. The DNN-HMM system trained on the multi-condition training set achieved a conspicuously higher word accuracy compared to the MLLR-adapted GMM-HMM system trained on the same data. Furthermore, feature enhancement with the deep autoencoder contributed to the improvement of recognition accuracy especially in the more adverse conditions. While the mapping between reverberant and clean speech in DAE-based dereverberation is conventionally conducted only with the acoustic information, we presume the mapping is also dependent on the phone information. Therefore, we propose a new scheme (pDAE), which augments a phone-class feature to the standard acoustic features as input. Two types of the phone-class feature are investigated. One is the hard recognition result of monophones, and the other is a soft representation derived from the posterior outputs of monophone DNN. The augmented feature in either type results in a significant improvement (7-8 % relative) from the standard DAE.

  14. Speech-enabled Computer-aided Translation

    DEFF Research Database (Denmark)

    Mesa-Lao, Bartolomé

    2014-01-01

    The present study has surveyed post-editor trainees’ views and attitudes before and after the introduction of speech technology as a front end to a computer-aided translation workbench. The aim of the survey was (i) to identify attitudes and perceptions among post-editor trainees before performing...... a post-editing task using automatic speech recognition (ASR); and (ii) to assess the degree to which post-editors’ attitudes and expectations to the use of speech technology changed after actually using it. The survey was based on two questionnaires: the first one administered before the...

  15. Direcionalidade e reconhecimento de fala no ruído: estudo de quatro casos Directionality and speech recognition in noise: study of four cases

    Directory of Open Access Journals (Sweden)

    Mariane Turella Mazzochi

    2012-01-01

    Full Text Available A dificuldade de compreensão de fala no ruído é apontada como uma das principais incapacidades pelo usuário de Aparelho de Amplificação Sonora Individual. O objetivo desse estudo foi comparar o desempenho auditivo de sujeitos portadores de perda auditiva neurossensorial, bilateral, de grau leve a moderado, com os microfones ominidirecional, direcional fixo e direcional adaptativo automaticamente ativado, por meio da relação sinal/ruído (S/R nas quais são obtidos os Limiares de Reconhecimento de Sentenças no Ruído (LRSR. Utilizaram-se os aparelhos Reach, modelo RCH62, da marca Beltone, nos modos microfone omnidirecional, direcional fixo e direcional adaptativo automaticamente ativado. Foram testadas as seguintes situações de apresentação dos estímulos acústicos: fala 0° azimute e ruído 180° azimute (0°/180°, fala 90° azimute e ruído 270° (90°/270° azimute e fala 270° azimute e ruído 90° azimute (270°/90°. A média das relações sinal/ruído variou de 6,6 dB a -6,9 dB. O microfone que apresentou melhor média para a relação sinal/ruído, considerando as três situações de apresentação dos estímulos, foi o direcional adaptativo automaticamente ativado. Entretanto, por se tratar de uma amostra pequena, houve grande variabilidade individual. Mais estudos devem ser realizados a fim de que se tenham subsídios científicos para a seleção dos microfones mais apropriados.The difficulty of understanding speech with background noise is perceived as one of the main disabilities of the hearing aids users. The purpose of this study was to compare the hearing performance of subjects with sensorioneural, bilateral, light to moderate degree hearing loss with the microphones omnidirectional, fixed directional mode and automatically activated adaptive directional mode, activated through the signal/noise ratio (S/R in which the Sentence Recognition Threshold in Noise are obtained. It was used the hearing aid Reach, RCH62

  16. The accuracy of radiology speech recognition reports in a multilingual South African teaching hospital

    International Nuclear Information System (INIS)

    Speech recognition (SR) technology, the process whereby spoken words are converted to digital text, has been used in radiology reporting since 1981. It was initially anticipated that SR would dominate radiology reporting, with claims of up to 99% accuracy, reduced turnaround times and significant cost savings. However, expectations have not yet been realised. The limited data available suggest SR reports have significantly higher levels of inaccuracy than traditional dictation transcription (DT) reports, as well as incurring greater aggregate costs. There has been little work on the clinical significance of such errors, however, and little is known of the impact of reporter seniority on the generation of errors, or the influence of system familiarity on reducing error rates. Furthermore, there have been conflicting findings on the accuracy of SR amongst users with English as first- and second-language respectively. The aim of the study was to compare the accuracy of SR and DT reports in a resource-limited setting. The first 300 SR and the first 300 DT reports generated during March 2010 were retrieved from the hospital’s PACS, and reviewed by a single observer. Text errors were identified, and then classified as either clinically significant or insignificant based on their potential impact on patient management. In addition, a follow-up analysis was conducted exactly 4 years later. Of the original 300 SR reports analysed, 25.6% contained errors, with 9.6% being clinically significant. Only 9.3% of the DT reports contained errors, 2.3% having potential clinical impact. Both the overall difference in SR and DT error rates, and the difference in ‘clinically significant’ error rates (9.6% vs. 2.3%) were statistically significant. In the follow-up study, the overall SR error rate was strikingly similar at 24.3%, 6% being clinically significant. Radiologists with second-language English were more likely to generate reports containing errors, but level of seniority

  17. Impact of a PACS/RIS-integrated speech recognition system on radiology reporting time and report availability

    International Nuclear Information System (INIS)

    Purpose: Quantification of the impact of a PACS/RIS-integrated speech recognition system (SRS) on the time expenditure for radiology reporting and on hospital-wide report availability (RA) in a university institution. Material and Methods: In a prospective pilot study, the following parameters were assessed for 669 radiographic examinations (CR): 1. time requirement per report dictation (TED: dictation time (s)/number of images [examination] x number of words [report]) with either a combination of PACS/tape-based dictation (TD: analog dictation device/minicassette/transcription) or PACS/RIS/speech recognition system (RR: remote recognition/transcription and OR: online recognition/self-correction by radiologist), respectively, and 2. the Report Turnaround Time (RTT) as the time interval from the entry of the first image into the PACS to the available RIS/HIS report. Two equal time periods were chosen retrospectively from the RIS database: 11/2002-2/2003 (only TD) and 11/2003-2/2004 (only RR or OR with speech recognition system [SRS]). The midterm (≥24 h, 24 h intervals) and short-term (< 24 h, 1 h intervals), RA after examination completion were calculated for all modalities and for Cr, CT, MR and XA/DS separately. The relative increase in the mid-term RA (RIMRA: related to total number of examinations in each time period) and increase in the short-term RA (ISRA: ratio of available reports during the 1st to 24th hour) were calculated. Results: Prospectively, there was a significant difference between TD/RR/OR (n=151/257/261) regarding mean TED (0.44/0.54/0.62 s [per word and image]) and mean RTT (10.47/6.65/1.27 h), respectively. Retrospectively, 37 898/39 680 reports were computed from the RIS database for the time periods of 11/2002-2/2003 and 11/2003-2/2004. For CR/CT there was a shift of the short-term RA to the first 6 hours after examination completion (mean cumulative RA 20% higher) with a more than three-fold increase in the total number of available

  18. Towards Robust Visual Speech Recognition: Automatic Systems for Lip Reading of Dutch

    NARCIS (Netherlands)

    Chitu, A.G.

    2010-01-01

    In the last two decades we witnessed a rapid increase of the computational power governed by Moore's Law. As a side effect, the affordability of cheaper and faster CPUs increased as well. Therefore, many new “smart” devices flooded the market and made informational systems widely spread. The number

  19. APPLYING RECOGNITION OF EMOTIONS IN SPEECH TO EXTEND THE IMPACT OF BRAND SLOGAN RESEARCH

    OpenAIRE

    Chien, Charles S.; Wan-Chen, Wang; Moutinho, Luiz; Cheng, Yun-Maw; Pao, Tsang-Long; Yu-Te, Chen; Jun-Heng, Yeh

    2007-01-01

    How brand slogans can influence and change the consumers' perception of image of products has been a topic of great interest to marketers. However, it is a non-trivial task to evaluate how brand slogans affect their customers' emotions and how the emotions influence the customers' perceptions of brand images. In this paper we demonstrate the Slogan Validator to evaluate brand slogans by analyzing the speech signals from customers' voiced slogans. It is arguably the first speech signal based a...

  20. Speech Enhancement and Recognition in Meetings with an Audio-Visual Sensor Array

    OpenAIRE

    Maganti, Hari Krishna; Gatica-Perez, Daniel; McCowan, Iain A.

    2006-01-01

    We address the problem of distant speech acquisition in multi-party meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering and directional discrimination. Beamforming techniques rely on the knowledge of a speaker location. In this paper, we present an integrated approach, in which an audio-visual multi-person tracker is used to track active ...