Face-to-face communication involves both hearing and seeing speech. Heard and seen speech inputs interact during audiovisual speech perception. Specifically, seeing the speaker's mouth and lip movements improves identification of acoustic speech stimuli, especially in noisy conditions. In addition, visual speech may even change the auditory percept. This occurs when mismatching auditory speech is dubbed onto visual articulation. Research on the brain mechanisms of audiovisual perception a...
Peelle, Jonathan E; Sommers, Mitchell S
During face-to-face conversational speech listeners must efficiently process a rapid and complex stream of multisensory information. Visual speech can serve as a critical complement to auditory information because it provides cues to both the timing of the incoming acoustic signal (the amplitude envelope, influencing attention and perceptual sensitivity) and its content (place and manner of articulation, constraining lexical selection). Here we review behavioral and neurophysiological evidence regarding listeners' use of visual speech information. Multisensory integration of audiovisual speech cues improves recognition accuracy, particularly for speech in noise. Even when speech is intelligible based solely on auditory information, adding visual information may reduce the cognitive demands placed on listeners through increasing the precision of prediction. Electrophysiological studies demonstrate that oscillatory cortical entrainment to speech in auditory cortex is enhanced when visual speech is present, increasing sensitivity to important acoustic cues. Neuroimaging studies also suggest increased activity in auditory cortex when congruent visual information is available, but additionally emphasize the involvement of heteromodal regions of posterior superior temporal sulcus as playing a role in integrative processing. We interpret these findings in a framework of temporally-focused lexical competition in which visual speech information affects auditory processing to increase sensitivity to acoustic information through an early integration mechanism, and a late integration stage that incorporates specific information about a speaker's articulators to constrain the number of possible candidates in a spoken utterance. Ultimately it is words compatible with both auditory and visual information that most strongly determine successful speech perception during everyday listening. Thus, audiovisual speech perception is accomplished through multiple stages of integration
Audiovisual information is integrated in speech perception. One manifestation of this is the McGurk illusion in which watching the articulating face alters the auditory phonetic percept. Understanding this phenomenon fully requires a computational model with predictive power. Here, we describe...... ordinal models that can account for the McGurk illusion. We compare this type of models to the Fuzzy Logical Model of Perception (FLMP) in which the response categories are not ordered. While the FLMP generally fit the data better than the ordinal model it also employs more free parameters in complex...... experiments when the number of response categories are high as it is for speech perception in general. Testing the predictive power of the models using a form of cross-validation we found that ordinal models perform better than the FLMP. Based on these findings we suggest that ordinal models generally have...
Eskelund, Kasper; Dau, Torsten
auditory speech percept? In two experiments, which both combine behavioral and neurophysiological measures, an uncovering of the relation between perception of faces and of audiovisual integration is attempted. Behavioral findings suggest a strong effect of face perception, whereas the MMN results are less......Speech perception integrates signal from ear and eye. This is witnessed by a wide range of audiovisual integration effects, such as ventriloquism and the McGurk illusion. Some behavioral evidence suggest that audiovisual integration of specific aspects is special for speech perception. However, our...... mismatch negativity response (MMN). MMN has the property of being evoked when an acoustic stimulus deviates from a learned pattern of stimuli. In three experimental studies, this effect is utilized to track when a coinciding visual signal alters auditory speech perception. Visual speech emanates from the...
Andersen, Tobias; Tiippana, K.; Laarni, J.;
recent reports have challenged this view. Here we study the effect of visual spatial attention on the McGurk effect. By presenting a movie of two faces symmetrically displaced to each side of a central fixation point and dubbed with a single auditory speech track, we were able to discern the influences......Auditory and visual information is integrated when perceiving speech, as evidenced by the McGurk effect in which viewing an incongruent talking face categorically alters auditory speech perception. Audiovisual integration in speech perception has long been considered automatic and pre-attentive but...... from each of the faces and from the voice on the auditory speech percept. We found that directing visual spatial attention towards a face increased the influence of that face on auditory perception. However, the influence of the voice on auditory perception did not change suggesting that audiovisual...
Andersen, Tobias; Tiippana, K.; Laarni, J.; Kojo, I.; Sams, M.
Auditory and visual information is integrated when perceiving speech, as evidenced by the McGurk effect in which viewing an incongruent talking face categorically alters auditory speech perception. Audiovisual integration in speech perception has long been considered automatic and pre-attentive but recent reports have challenged this view. Here we study the effect of visual spatial attention on the McGurk effect. By presenting a movie of two faces symmetrically displaced to each side of a cen...
Barkhuysen, Pashiera; Krahmer, E.J.; Swerts, M.G.J.
In this article we report on two experiments about the perception of audiovisual cues to emotional speech. The article addresses two questions: (1) how do visual cues from a speaker's face to emotion relate to auditory cues, and (2) what is the recognition speed for various facial cues to emotion? B
Barkhuysen, Pashiera; Krahmer, Emiel; Swerts, Marc
In this article we report on two experiments about the perception of audiovisual cues to emotional speech. The article addresses two questions: (1) how do visual cues from a speaker's face to emotion relate to auditory cues, and (2) what is the recognition speed for various facial cues to emotion? Both experiments reported below are based on tests…
Eskelund, Kasper; Andersen, Tobias
Speech perception is audiovisual as evidenced by the McGurk effect in which watching incongruent articulatory mouth movements can change the phonetic auditory speech percept. This type of audiovisual integration may be specific to speech or be applied to all stimuli in general. To investigate this...... audiovisual integration specific to speech perception. However, the results of Tuomainen et al. might have been influenced by another effect. When observers were naïve, they had little motivation to look at the face. When informed, they knew that the face was relevant for the task and this could increase...... noise were measured for naïve and informed participants. We found that the threshold for detecting speech in audiovisual stimuli was lower than for auditory-only stimuli. But there was no detection advantage for observers informed of the speech nature of the auditory signal. This may indicate that...
Gentilucci, Maurizio; Cattaneo, Luigi
Two experiments aimed to determine whether features of both the visual and acoustical inputs are always merged into the perceived representation of speech and whether this audiovisual integration is based on either cross-modal binding functions or on imitation. In a McGurk paradigm, observers were required to repeat aloud a string of phonemes uttered by an actor (acoustical presentation of phonemic string) whose mouth, in contrast, mimicked pronunciation of a different string (visual presentation). In a control experiment participants read the same printed strings of letters. This condition aimed to analyze the pattern of voice and the lip kinematics controlling for imitation. In the control experiment and in the congruent audiovisual presentation, i.e. when the articulation mouth gestures were congruent with the emission of the string of phones, the voice spectrum and the lip kinematics varied according to the pronounced strings of phonemes. In the McGurk paradigm the participants were unaware of the incongruence between visual and acoustical stimuli. The acoustical analysis of the participants' spoken responses showed three distinct patterns: the fusion of the two stimuli (the McGurk effect), repetition of the acoustically presented string of phonemes, and, less frequently, of the string of phonemes corresponding to the mouth gestures mimicked by the actor. However, the analysis of the latter two responses showed that the formant 2 of the participants' voice spectra always differed from the value recorded in the congruent audiovisual presentation. It approached the value of the formant 2 of the string of phonemes presented in the other modality, which was apparently ignored. The lip kinematics of the participants repeating the string of phonemes acoustically presented were influenced by the observation of the lip movements mimicked by the actor, but only when pronouncing a labial consonant. The data are discussed in favor of the hypothesis that features of both
Full Text Available To take a step towards real-life-like experimental setups, we simultaneously recorded magnetoencephalographic (MEG signals and subject’s gaze direction during audiovisual speech perception. The stimuli were utterances of /apa/ dubbed onto two side-by-side female faces articulating /apa/ (congruent and /aka/ (incongruent in synchrony, repeated once every 3 s. Subjects (N = 10 were free to decide which face they viewed, and responses were averaged to two categories according to the gaze direction. The right-hemisphere 100-ms response to the onset of the second vowel (N100m’ was a fifth smaller to incongruent than congruent stimuli. The results demonstrate the feasibility of realistic viewing conditions with gaze-based averaging of MEG signals.
Meronen, Auli; Tiippana, Kaisa; Westerholm, Jari; Ahonen, Timo
Purpose: The effect of the signal-to-noise ratio (SNR) on the perception of audiovisual speech in children with and without developmental language disorder (DLD) was investigated by varying the noise level and the sound intensity of acoustic speech. The main hypotheses were that the McGurk effect (in which incongruent visual speech alters the…
Buchan, Julie; Paré, Martin; Yurick, Micheal; Munhall, Kevin
In natural conversation, visual and auditory information about speech not only provide linguistic information but also provide information about the identity and the emotional state of the speaker. Thus, listeners must process a wide range of information in parallel to understand the full meaning in a message. In this series of studies, we examined how different types of visual information conveyed by a speaker's face are processed by measuring the gaze patterns exhibited by subjects watching audiovisual recordings of spoken sentences. In three experiments, subjects were asked to judge the emotion and the identity of the speaker, and to report the words that they heard under different auditory conditions. As in previous studies, eye and mouth regions dominated the distribution of the gaze fixations. It was hypothesized that the eyes would attract more fixations for more social judgment tasks, rather than tasks which rely more on verbal comprehension. Our results support this hypothesis. In addition, the location of gaze on the face did not influence the accuracy of the perception of speech in noise.
Full Text Available Seeing articulatory movements influences perception of auditory speech. This is often reflected in a shortened latency of auditory event-related potentials (ERPs generated in the auditory cortex. The present study addressed whether this early neural correlate of audiovisual interaction is modulated by attention. We recorded ERPs in 15 subjects while they were presented with auditory, visual and audiovisual spoken syllables. Audiovisual stimuli consisted of incongruent auditory and visual components known to elicit a McGurk effect, i.e. a visually driven alteration in the auditory speech percept. In a Dual task condition, participants were asked to identify spoken syllables whilst monitoring a rapid visual stream of pictures for targets, i.e., they had to divide their attention. In a Single task condition, participants identified the syllables without any other tasks, i.e., they were asked to ignore the pictures and focus their attention fully on the spoken syllables. The McGurk effect was weaker in the Dual task than in the Single task condition, indicating an effect of attentional load on audiovisual speech perception. Early auditory ERP components, N1 and P2, peaked earlier to audiovisual stimuli than to auditory stimuli when attention was fully focused on syllables, indicating neurophysiological audiovisual interaction. This latency decrement was reduced when attention was loaded, suggesting that attention influences early neural processing of audiovisual speech. We conclude that reduced attention weakens the interaction between vision and audition in speech.
Knowland, Victoria C. P.; Mercure, Evelyne; Karmiloff-Smith, Annette; Dick, Fred; Thomas, Michael S. C.
Being able to see a talking face confers a considerable advantage for speech perception in adulthood. However, behavioural data currently suggest that children fail to make full use of these available visual speech cues until age 8 or 9. This is particularly surprising given the potential utility of multiple informational cues during language…
Eskelund, Kasper; Andersen, Tobias
, observers might only have been motivated to look at the face when informed and audio and video thus seemed related. Since Tuomainen et al. did not control for this, the influence of motivation is unknown. The current experiment repeated the original methods while controlling eye movements. 4 observers...... observers did look near the mouth. We conclude that eye-movements did not influence the results of Tuomainen et al. and that their results thus can be taken as evidence of a speech specific mode of audiovisual integration underlying the McGurk illusion....
Altvater-Mackensen, Nicole; Mani, Nivedita; Grossmann, Tobias
Recent studies suggest that infants' audiovisual speech perception is influenced by articulatory experience (Mugitani et al., 2008; Yeung & Werker, 2013). The current study extends these findings by testing if infants' emerging ability to produce native sounds in babbling impacts their audiovisual speech perception. We tested 44 6-month-olds…
Venezia, Jonathan H; Thurman, Steven M; Matchin, William; George, Sahara E; Hickok, Gregory
Recent influential models of audiovisual speech perception suggest that visual speech aids perception by generating predictions about the identity of upcoming speech sounds. These models place stock in the assumption that visual speech leads auditory speech in time. However, it is unclear whether and to what extent temporally-leading visual speech information contributes to perception. Previous studies exploring audiovisual-speech timing have relied upon psychophysical procedures that require artificial manipulation of cross-modal alignment or stimulus duration. We introduce a classification procedure that tracks perceptually relevant visual speech information in time without requiring such manipulations. Participants were shown videos of a McGurk syllable (auditory /apa/ + visual /aka/ = perceptual /ata/) and asked to perform phoneme identification (/apa/ yes-no). The mouth region of the visual stimulus was overlaid with a dynamic transparency mask that obscured visual speech in some frames but not others randomly across trials. Variability in participants' responses (~35 % identification of /apa/ compared to ~5 % in the absence of the masker) served as the basis for classification analysis. The outcome was a high resolution spatiotemporal map of perceptually relevant visual features. We produced these maps for McGurk stimuli at different audiovisual temporal offsets (natural timing, 50-ms visual lead, and 100-ms visual lead). Briefly, temporally-leading (~130 ms) visual information did influence auditory perception. Moreover, several visual features influenced perception of a single speech sound, with the relative influence of each feature depending on both its temporal relation to the auditory signal and its informational content. PMID:26669309
Speech perception is facilitated by seeing the articulatory mouth movements of the talker. This is due to perceptual audiovisual integration, which also causes the McGurk−MacDonald illusion, and for which a comprehensive computational account is still lacking. Decades of research have largely...... focused on the fuzzy logical model of perception (FLMP), which provides excellent fits to experimental observations but also has been criticized for being too flexible, post hoc and difficult to interpret. The current study introduces the early maximum likelihood estimation (MLE) model of audiovisual...... integration to speech perception along with three model variations. In early MLE, integration is based on a continuous internal representation before categorization, which can make the model more parsimonious by imposing constraints that reflect experimental designs. The study also shows that cross...
Kim, Heejung; Hahm, Jarang; Lee, Hyekyoung; Kang, Eunjoo; Kang, Hyejin; Lee, Dong Soo
The human brain naturally integrates audiovisual information to improve speech perception. However, in noisy environments, understanding speech is difficult and may require much effort. Although the brain network is supposed to be engaged in speech perception, it is unclear how speech-related brain regions are connected during natural bimodal audiovisual or unimodal speech perception with counterpart irrelevant noise. To investigate the topological changes of speech-related brain networks at all possible thresholds, we used a persistent homological framework through hierarchical clustering, such as single linkage distance, to analyze the connected component of the functional network during speech perception using functional magnetic resonance imaging. For speech perception, bimodal (audio-visual speech cue) or unimodal speech cues with counterpart irrelevant noise (auditory white-noise or visual gum-chewing) were delivered to 15 subjects. In terms of positive relationship, similar connected components were observed in bimodal and unimodal speech conditions during filtration. However, during speech perception by congruent audiovisual stimuli, the tighter couplings of left anterior temporal gyrus-anterior insula component and right premotor-visual components were observed than auditory or visual speech cue conditions, respectively. Interestingly, visual speech is perceived under white noise by tight negative coupling in the left inferior frontal region-right anterior cingulate, left anterior insula, and bilateral visual regions, including right middle temporal gyrus, right fusiform components. In conclusion, the speech brain network is tightly positively or negatively connected, and can reflect efficient or effortful processes during natural audiovisual integration or lip-reading, respectively, in speech perception. PMID:25495216
Full Text Available Gender and age have been found to affect adults’ audio-visual (AV speech perception. However, research on adult aging focuses on adults over 60 years, who have an increasing likelihood for cognitive and sensory decline, which may confound positive effects of age-related AV-experience and its interaction with gender. Observed age and gender differences in AV speech perception may also depend on measurement sensitivity and AV task difficulty. Consequently both AV benefit and visual influence were used to measure visual contribution for gender-balanced groups of young (20-30 years and middle-aged adults (50-60 years with task difficulty varied using AV syllables from different talkers in alternative auditory backgrounds. Females had better speech-reading performance than males. Whereas no gender differences in AV benefit or visual influence were observed for young adults, visually influenced responses were significantly greater for middle-aged females than middle-aged males. That speech-reading performance did not influence AV benefit may be explained by visual speech extraction and AV integration constituting independent abilities. Contrastingly, the gender difference in visually influenced responses in middle adulthood may reflect an experience-related shift in females’ general AV perceptual strategy. Although young females’ speech-reading proficiency may not readily contribute to greater visual influence, between young and middle-adulthood recurrent confirmation of the contribution of visual cues induced by speech-reading proficiency may gradually shift females AV perceptual strategy towards more visually dominated responses.
Full Text Available Speech perception under audiovisual conditions is well known to confer benefits to perception such as increased speed and accuracy. Here, we investigated how audiovisual training might benefit or impede auditory perceptual learning speech degraded by vocoding. In Experiments 1 and 3, participants learned paired associations between vocoded spoken nonsense words and nonsense pictures in a protocol with a fixed number of trials. In Experiment 1, paired-associates (PA audiovisual (AV training of one group of participants was compared with audio-only (AO training of another group. When tested under AO conditions, the AV-trained group was significantly more accurate than the AO-trained group. In addition, pre- and post-training AO forced-choice consonant identification with untrained nonsense words showed that AV-trained participants had learned significantly more than AO participants. The pattern of results pointed to their having learned at the level of the auditory phonetic features of the vocoded stimuli. Experiment 2, a no-training control with testing and re-testing on the AO consonant identification, showed that the controls were as accurate as the AO-trained participants in Experiment 1 but less accurate than the AV-trained participants. In Experiment 3, PA training alternated AV and AO conditions on a list-by-list basis within participants, and training was to criterion (92% correct. PA training with AO stimuli was reliably more effective than training with AV stimuli. We explain these discrepant results in terms of the so-called "reverse hierarchy theory" of perceptual learning and in terms of the diverse multisensory and unisensory processing resources available to speech perception. We propose that early audiovisual speech integration can potentially impede auditory perceptual learning; but visual top-down access to relevant auditory features can promote auditory perceptual learning.
Full Text Available Previous studies have found that perception in older people benefits from multisensory over uni-sensory information. As normal speech recognition is affected by both the auditory input and the visual lip-movements of the speaker, we investigated the efficiency of audio and visual integration in an older population by manipulating the relative reliability of the auditory and visual information in speech. We also investigated the role of the semantic context of the sentence to assess whether audio-visual integration is affected by top-down semantic processing. We presented participants with audio-visual sentences in which the visual component was either blurred or not blurred. We found that there was a greater cost in recall performance for semantically meaningless speech in the audio-visual blur compared to audio-visual no blur condition and this effect was specific to the older group. Our findings have implications for understanding how aging affects efficient multisensory integration for the perception of speech and suggests that multisensory inputs may benefit speech perception in older adults when the semantic content of the speech is unpredictable.
Full Text Available Background and Aim: Neuroimaging techniques in audiovisual speech processing are innovative approach to neuroscience investigation that steadily influences the deep survey of highly mechanisms involved in this process. The purpose of this study was to evaluate brain activity via functional magnetic resonance imaging throughout audiovisual speech perception in Persian language.Methods: Functional MRI was used to assess 19 normal 20-30 year old women while they had been presented syllable /ka/ visually and /pa/ auditory using block design method, in which it would provide two series of imaging, functional and T1-weighted. Subsequently, the results were analyzed and compared by FSL software.Results: The results of this study pointed out that both middle and cortical regions of brain are activated in visual stimuli and its middle regions are activated in response to auditory stimuli. Hence, left anterior supramarginal, some parts of motor speech system including insular and cingulate cortex-precentral cortex were stimulated with visual stimulus and left posterior supramarginal as well as right supramarginal gyrus were stimulated with auditory stimulus. Moreover, in this investigation, McGurk effect was behaviorally proven in fifteen subjects.Conclusion: It was hypothesized that the activation of unique region, supramarginal gyrus, with both audio and visual stimuli indicated the presence of commonplace region for phonologic processing of sensory inputs. In addition, auditory stimuli develop more intense activity; and on the other hand, broaden-maximum voxel-as well as extra regions are demonstrated in response to visual stimuli. These points represent the unfamiliarity of normal individual brain to percept visual speech stimuli.
Full Text Available (1 To evaluate the recognition of words, phonemes and lexical tones in audiovisual (AV and auditory-only (AO modes in Mandarin-speaking adults with cochlear implants (CIs; (2 to understand the effect of presentation levels on AV speech perception; (3 to learn the effect of hearing experience on AV speech perception.Thirteen deaf adults (age = 29.1±13.5 years; 8 male, 5 female who had used CIs for >6 months and 10 normal-hearing (NH adults participated in this study. Seven of them were prelingually deaf, and 6 postlingually deaf. The Mandarin Monosyllablic Word Recognition Test was used to assess recognition of words, phonemes and lexical tones in AV and AO conditions at 3 presentation levels: speech detection threshold (SDT, speech recognition threshold (SRT and 10 dB SL (re:SRT.The prelingual group had better phoneme recognition in the AV mode than in the AO mode at SDT and SRT (both p = 0.016, and so did the NH group at SDT (p = 0.004. Mode difference was not noted in the postlingual group. None of the groups had significantly different tone recognition in the 2 modes. The prelingual and postlingual groups had significantly better phoneme and tone recognition than the NH one at SDT in the AO mode (p = 0.016 and p = 0.002 for phonemes; p = 0.001 and p<0.001 for tones but were outperformed by the NH group at 10 dB SL (re:SRT in both modes (both p<0.001 for phonemes; p<0.001 and p = 0.002 for tones. The recognition scores had a significant correlation with group with age and sex controlled (p<0.001.Visual input may help prelingually deaf implantees to recognize phonemes but may not augment Mandarin tone recognition. The effect of presentation level seems minimal on CI users' AV perception. This indicates special considerations in developing audiological assessment protocols and rehabilitation strategies for implantees who speak tonal languages.
Williams, Joshua T; Darcy, Isabelle; Newman, Sharlene D
The aim of the present study was to characterize effects of learning a sign language on the processing of a spoken language. Specifically, audiovisual phoneme comprehension was assessed before and after 13 weeks of sign language exposure. L2 ASL learners performed this task in the fMRI scanner. Results indicated that L2 American Sign Language (ASL) learners' behavioral classification of the speech sounds improved with time compared to hearing nonsigners. Results indicated increased activation in the supramarginal gyrus (SMG) after sign language exposure, which suggests concomitant increased phonological processing of speech. A multiple regression analysis indicated that learner's rating on co-sign speech use and lipreading ability was correlated with SMG activation. This pattern of results indicates that the increased use of mouthing and possibly lipreading during sign language acquisition may concurrently improve audiovisual speech processing in budding hearing bimodal bilinguals. PMID:26740404
Stevenson, Ryan A.; Siemann, Justin K.; Woynaroski, Tiffany G.; Schneider, Brittany C.; Eberly, Haley E.; Camarata, Stephen M.; Wallace, Mark T.
Atypical communicative abilities are a core marker of Autism Spectrum Disorders (ASD). A number of studies have shown that, in addition to auditory comprehension differences, individuals with autism frequently show atypical responses to audiovisual speech, suggesting a multisensory contribution to these communicative differences from their…
Thomas, Sharon M.; Jordan, Timothy R.
Seeing a talker's face influences auditory speech recognition, but the visible input essential for this influence has yet to be established. Using a new seamless editing technique, the authors examined effects of restricting visible movement to oral or extraoral areas of a talking face. In Experiment 1, visual speech identification and visual…
visual lip features is used. Phoneme-related receptive fields result on the SOM basis; they are speaker dependent and show individual locations and strain. Overlapping main slopes indicate a high similarity of respective units; distortion or extra peaks originate from the influence of other units....... Dependent on the training data, these other units may also be contextually immediate neighboring units. The poster demonstrates the idea with text material spoken by one individual subject using a set of simple audio-visual features. The data material for the training process consists of 44 labeled...
Van der Burg, Erik; Goodbourn, Patrick T.
The brain is adaptive. The speed of propagation through air, and of low-level sensory processing, differs markedly between auditory and visual stimuli; yet the brain can adapt to compensate for the resulting cross-modal delays. Studies investigating temporal recalibration to audiovisual speech have used prolonged adaptation procedures, suggesting that adaptation is sluggish. Here, we show that adaptation to asynchronous audiovisual speech occurs rapidly. Participants viewed a brief clip of an...
Megnin-Viggars, Odette; Goswami, Usha
Visual speech inputs can enhance auditory speech information, particularly in noisy or degraded conditions. The natural statistics of audiovisual speech highlight the temporal correspondence between visual and auditory prosody, with lip, jaw, cheek and head movements conveying information about the speech envelope. Low-frequency spatial and…
Martin, Jean-Remy; Kösem, Anne; van Wassenhove, Virginie
The effect of stimulation history on the perception of a current event can yield two opposite effects, namely: adaptation or hysteresis. The perception of the current event thus goes in the opposite or in the same direction as prior stimulation, respectively. In audiovisual (AV) synchrony perception, adaptation effects have primarily been reported. Here, we tested if perceptual hysteresis could also be observed over adaptation in AV timing perception by varying different experimental conditio...
Klitsch, Julia Ulrike
This dissertation investigates speech perception in three different groups of native adult speakers of Dutch; an aphasic and two age-varying control groups. By means of two different experiments it is examined if the availability of visual articulatory information is beneficial to the auditory speec
Traditionally, second language (L2) instruction has emphasised auditory-based instruction methods. However, this approach is restrictive in the sense that speech perception by humans is not just an auditory phenomenon but a multimodal one, and specifically, a visual one as well. In the past decade, experimental studies have shown that the…
D'Souza, Dean; D'Souza, Hana; Johnson, Mark H; Karmiloff-Smith, Annette
Typically-developing (TD) infants can construct unified cross-modal percepts, such as a speaking face, by integrating auditory-visual (AV) information. This skill is a key building block upon which higher-level skills, such as word learning, are built. Because word learning is seriously delayed in most children with neurodevelopmental disorders, we assessed the hypothesis that this delay partly results from a deficit in integrating AV speech cues. AV speech integration has rarely been investigated in neurodevelopmental disorders, and never previously in infants. We probed for the McGurk effect, which occurs when the auditory component of one sound (/ba/) is paired with the visual component of another sound (/ga/), leading to the perception of an illusory third sound (/da/ or /tha/). We measured AV integration in 95 infants/toddlers with Down, fragile X, or Williams syndrome, whom we matched on Chronological and Mental Age to 25 TD infants. We also assessed a more basic AV perceptual ability: sensitivity to matching vs. mismatching AV speech stimuli. Infants with Williams syndrome failed to demonstrate a McGurk effect, indicating poor AV speech integration. Moreover, while the TD children discriminated between matching and mismatching AV stimuli, none of the other groups did, hinting at a basic deficit or delay in AV speech processing, which is likely to constrain subsequent language development. PMID:27498221
Van der Burg, Erik; Goodbourn, Patrick T
The brain is adaptive. The speed of propagation through air, and of low-level sensory processing, differs markedly between auditory and visual stimuli; yet the brain can adapt to compensate for the resulting cross-modal delays. Studies investigating temporal recalibration to audiovisual speech have used prolonged adaptation procedures, suggesting that adaptation is sluggish. Here, we show that adaptation to asynchronous audiovisual speech occurs rapidly. Participants viewed a brief clip of an actor pronouncing a single syllable. The voice was either advanced or delayed relative to the corresponding lip movements, and participants were asked to make a synchrony judgement. Although we did not use an explicit adaptation procedure, we demonstrate rapid recalibration based on a single audiovisual event. We find that the point of subjective simultaneity on each trial is highly contingent upon the modality order of the preceding trial. We find compelling evidence that rapid recalibration generalizes across different stimuli, and different actors. Finally, we demonstrate that rapid recalibration occurs even when auditory and visual events clearly belong to different actors. These results suggest that rapid temporal recalibration to audiovisual speech is primarily mediated by basic temporal factors, rather than higher-order factors such as perceived simultaneity and source identity. PMID:25716790
Alsius, Agnès; Navarra, Jordi; Campbell, Ruth; Soto-Faraco, Salvador
One of the most commonly cited examples of human multisensory integration occurs during exposure to natural speech, when the vocal and the visual aspects of the signal are integrated in a unitary percept. Audiovisual association of facial gestures and vocal sounds has been demonstrated in nonhuman primates and in prelinguistic children, arguing for a general basis for this capacity. One critical question, however, concerns the role of attention in such multisensory integration. Although both behavioral and neurophysiological studies have converged on a preattentive conceptualization of audiovisual speech integration, this mechanism has rarely been measured under conditions of high attentional load, when the observers' attention resources are depleted. We tested the extent to which audiovisual integration was modulated by the amount of available attentional resources by measuring the observers' susceptibility to the classic McGurk illusion in a dual-task paradigm. The proportion of visually influenced responses was severely, and selectively, reduced if participants were concurrently performing an unrelated visual or auditory task. In contrast with the assumption that crossmodal speech integration is automatic, our results suggest that these multisensory binding processes are subject to attentional demands. PMID:15886102
Tobias Søren Andersen
Full Text Available Lesions to Broca’s area cause aphasia characterised by a severe impairment of the ability to speak, with comparatively intact speech perception. However, some studies have found effects on speech perception under adverse listening conditions, indicating that Broca’s area is also involved in speech perception. While these studies have focused on auditory speech perception other studies have shown that Broca’s area is activated by visual speech perception. Furthermore, one preliminary report found that a patient with Broca’s aphasia did not experience the McGurk illusion suggesting that an intact Broca’s area is necessary for audiovisual integration of speech. Here we describe a patient with Broca’s aphasia who experienced the McGurk illusion. This indicates that an intact Broca’s area is not necessary for audiovisual integration of speech. The McGurk illusions this patient experienced were atypical, which could be due to Broca’s area having a more subtle role in audiovisual integration of speech. The McGurk illusions of a control subject with Wernicke’s aphasia were, however, also atypical. This indicates that the atypical McGurk illusions were due to deficits in speech processing that are not specific to Broca’s aphasia.
Hillock-Dunn, Andrea; Grantham, D Wesley; Wallace, Mark T
During a typical communication exchange, both auditory and visual cues contribute to speech comprehension. The influence of vision on speech perception can be measured behaviorally using a task where incongruent auditory and visual speech stimuli are paired to induce perception of a novel token reflective of multisensory integration (i.e., the McGurk effect). This effect is temporally constrained in adults, with illusion perception decreasing as the temporal offset between the auditory and visual stimuli increases. Here, we used the McGurk effect to investigate the development of the temporal characteristics of audiovisual speech binding in 7-24 year-olds. Surprisingly, results indicated that although older participants perceived the McGurk illusion more frequently, no age-dependent change in the temporal boundaries of audiovisual speech binding was observed. PMID:26920938
Lee, Hweeling; Noppeney, Uta
This psychophysics study used musicians as a model to investigate whether musical expertise shapes the temporal integration window for audiovisual speech, sinewave speech, or music. Musicians and non-musicians judged the audiovisual synchrony of speech, sinewave analogs of speech, and music stimuli at 13 audiovisual stimulus onset asynchronies (±360, ±300 ±240, ±180, ±120, ±60, and 0 ms). Further, we manipulated the duration of the stimuli by presenting sentences/melodies or syllables/tones. ...
D'Ausilio, Alessandro; Bartoli, Eleonora; Maffongelli, Laura; Berry, Jeffrey James; Fadiga, Luciano
Audiovisual speech perception is likely based on the association between auditory and visual information into stable audiovisual maps. Conflicting audiovisual inputs generate perceptual illusions such as the McGurk effect. Audiovisual mismatch effects could be either driven by the detection of violations in the standard audiovisual statistics or via the sensorimotor reconstruction of the distal articulatory event that generated the audiovisual ambiguity. In order to disambiguate between the two hypotheses we exploit the fact that the tongue is hidden to vision. For this reason, tongue movement encoding can solely be learned via speech production but not via others׳ speech perception alone. Here we asked participants to identify speech sounds while matching or mismatching visual representations of tongue movements which were shown. Vision of congruent tongue movements facilitated auditory speech identification with respect to incongruent trials. This result suggests that direct visual experience of an articulator movement is not necessary for the generation of audiovisual mismatch effects. Furthermore, we suggest that audiovisual integration in speech may benefit from speech production learning. PMID:25172391
Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias
signal. Here we show that identification of phonetic content and detection can be dissociated as speech-specific and non-specific audiovisual integration effects. To this end, we employed synthetically modified stimuli, sine wave speech (SWS), which is an impoverished speech signal that only observers...
Crosse, Michael J; Lalor, Edmund C
Visual speech can greatly enhance a listener's comprehension of auditory speech when they are presented simultaneously. Efforts to determine the neural underpinnings of this phenomenon have been hampered by the limited temporal resolution of hemodynamic imaging and the fact that EEG and magnetoencephalographic data are usually analyzed in response to simple, discrete stimuli. Recent research has shown that neuronal activity in human auditory cortex tracks the envelope of natural speech. Here, we exploit this finding by estimating a linear forward-mapping between the speech envelope and EEG data and show that the latency at which the envelope of natural speech is represented in cortex is shortened by >10 ms when continuous audiovisual speech is presented compared with audio-only speech. In addition, we use a reverse-mapping approach to reconstruct an estimate of the speech stimulus from the EEG data and, by comparing the bimodal estimate with the sum of the unimodal estimates, find no evidence of any nonlinear additive effects in the audiovisual speech condition. These findings point to an underlying mechanism that could account for enhanced comprehension during audiovisual speech. Specifically, we hypothesize that low-level acoustic features that are temporally coherent with the preceding visual stream may be synthesized into a speech object at an earlier latency, which may provide an extended period of low-level processing before extraction of semantic information. PMID:24401714
Woynaroski, Tiffany G.; Kwakye, Leslie D.; Foss-Feig, Jennifer H.; Stevenson, Ryan A.; Stone, Wendy L.; Wallace, Mark T.
This study examined unisensory and multisensory speech perception in 8-17 year old children with autism spectrum disorders (ASD) and typically developing controls matched on chronological age, sex, and IQ. Consonant-vowel syllables were presented in visual only, auditory only, matched audiovisual, and mismatched audiovisual ("McGurk")…
Jeanne A Guiraud
Full Text Available The language difficulties often seen in individuals with autism might stem from an inability to integrate audiovisual information, a skill important for language development. We investigated whether 9-month-old siblings of older children with autism, who are at an increased risk of developing autism, are able to integrate audiovisual speech cues. We used an eye-tracker to record where infants looked when shown a screen displaying two faces of the same model, where one face is articulating/ba/and the other/ga/, with one face congruent with the syllable sound being presented simultaneously, the other face incongruent. This method was successful in showing that infants at low risk can integrate audiovisual speech: they looked for the same amount of time at the mouths in both the fusible visual/ga/- audio/ba/and the congruent visual/ba/- audio/ba/displays, indicating that the auditory and visual streams fuse into a McGurk-type of syllabic percept in the incongruent condition. It also showed that low-risk infants could perceive a mismatch between auditory and visual cues: they looked longer at the mouth in the mismatched, non-fusible visual/ba/- audio/ga/display compared with the congruent visual/ga/- audio/ga/display, demonstrating that they perceive an uncommon, and therefore interesting, speech-like percept when looking at the incongruent mouth (repeated ANOVA: displays x fusion/mismatch conditions interaction: F(1,16 = 17.153, p = 0.001. The looking behaviour of high-risk infants did not differ according to the type of display, suggesting difficulties in matching auditory and visual information (repeated ANOVA, displays x conditions interaction: F(1,25 = 0.09, p = 0.767, in contrast to low-risk infants (repeated ANOVA: displays x conditions x low/high-risk groups interaction: F(1,41 = 4.466, p = 0.041. In some cases this reduced ability might lead to the poor communication skills characteristic of autism.
Holt, Lori L.; Lotto, Andrew J.
Speech perception (SP) most commonly refers to the perceptual mapping from the highly variable acoustic speech signal to a linguistic representation, whether it be phonemes, diphones, syllables, or words. This is an example of categorization, in that potentially discriminable speech sounds are assigned to functionally equivalent classes. In this tutorial, we present some of the main challenges to our understanding of the categorization of speech sounds and the conceptualization of SP that has...
Mroueh, Youssef; Marcheret, Etienne; Goel, Vaibhava
In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of $41\\%$ under clean condition on the IBM large vocabulary audio-visual studio datase...
Full Text Available Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Typically, the visual mode of the synthetic speech is synthesized separately from the audio, the latter being either natural or synthesized speech. However, the perception of mismatches between these two information streams requires experimental exploration since it could degrade the quality of the output. In order to increase the intermodal coherence in synthetic 2D photorealistic speech, we extended the well-known unit selection audio synthesis technique to work with multimodal segments containing original combinations of audio and video. Subjective experiments confirm that the audiovisual signals created by our multimodal synthesis strategy are indeed perceived as being more synchronous than those of systems in which both modes are not intrinsically coherent. Furthermore, it is shown that the degree of coherence between the auditory mode and the visual mode has an influence on the perceived quality of the synthetic visual speech fragment. In addition, the audio quality was found to have only a minor influence on the perceived visual signal's quality.
Lebib, Riadh; Papo, David; Douiri, Abdel; de Bode, Stella; Gillon Dowens, Margaret; Baudonnière, Pierre-Marie
Lipreading reliably improve speech perception during face-to-face conversation. Within the range of good dubbing, however, adults tolerate some audiovisual (AV) discrepancies and lipreading, then, can give rise to confusion. We used event-related brain potentials (ERPs) to study the perceptual strategies governing the intermodal processing of dynamic and bimodal speech stimuli, either congruently dubbed or not. Electrophysiological analyses revealed that non-coherent audiovisual dubbings modulated in amplitude an endogenous ERP component, the N300, we compared to a 'N400-like effect' reflecting the difficulty to integrate these conflicting pieces of information. This result adds further support for the existence of a cerebral system underlying 'integrative processes' lato sensu. Further studies should take advantage of this 'N400-like effect' with AV speech stimuli to open new perspectives in the domain of psycholinguistics. PMID:15531091
Full Text Available An increasing number of neuroscience papers capitalize on the assumption published in this journal that visual speech would be typically 150 ms ahead of auditory speech. It happens that the estimation of audiovisual asynchrony in the reference paper is valid only in very specific cases, for isolated consonant-vowel syllables or at the beginning of a speech utterance, in what we call "preparatory gestures". However, when syllables are chained in sequences, as they are typically in most parts of a natural speech utterance, asynchrony should be defined in a different way. This is what we call "comodulatory gestures" providing auditory and visual events more or less in synchrony. We provide audiovisual data on sequences of plosive-vowel syllables (pa, ta, ka, ba, da, ga, ma, na showing that audiovisual synchrony is actually rather precise, varying between 20 ms audio lead and 70 ms audio lag. We show how more complex speech material should result in a range typically varying between 40 ms audio lead and 200 ms audio lag, and we discuss how this natural coordination is reflected in the so-called temporal integration window for audiovisual speech perception. Finally we present a toy model of auditory and audiovisual predictive coding, showing that visual lead is actually not necessary for visual prediction.
Full Text Available Humans rely on multiple sensory modalities to determine the emotional state of others. In fact, such multisensory perception may be one of the mechanisms explaining the ease and efficiency by which others’ emotions are recognized. But how and when exactly do the different modalities interact? One aspect in multisensory perception that has received increasing interest in recent years is the concept of crossmodal prediction. In emotion perception, as in most other settings, visual information precedes the auditory one. Thereby, leading in visual information can facilitate subsequent auditory processing. While this mechanism has often been described in audiovisual speech perception, it has not been addressed so far in audiovisual emotion perception. Based on the current state of the art in (a crossmodal prediction and (b multisensory emotion perception research, we propose that it is essential to consider the former in order to fully understand the latter. Focusing on electroencephalographic (EEG and magnetoencephalographic (MEG studies, we provide a brief overview of the current research in both fields. In discussing these findings, we suggest that emotional visual information may allow for a more reliable prediction of auditory information compared to non-emotional visual information. In support of this hypothesis, we present a re-analysis of a previous data set that shows an inverse correlation between the N1 response in the EEG and the duration of visual emotional but not non-emotional information. If the assumption that emotional content allows for more reliable predictions can be corroborated in future studies, crossmodal prediction is a crucial factor in our understanding of multisensory emotion perception.
Hwee Ling eLee
Full Text Available This psychophysics study used musicians as a model to investigate whether musical expertise shapes the temporal integration window for audiovisual speech, sinewave speech or music. Musicians and non-musicians judged the audiovisual synchrony of speech, sinewave analogues of speech, and music stimuli at 13 audiovisual stimulus onset asynchronies (±360, ±300 ±240, ±180, ±120, ±60, and 0 ms. Further, we manipulated the duration of the stimuli by presenting sentences/melodies or syllables/tones. Critically, musicians relative to non-musicians exhibited significantly narrower temporal integration windows for both music and sinewave speech. Further, the temporal integration window for music decreased with the amount of music practice, but not with age of acquisition. In other words, the more musicians practiced piano in the past three years, the more sensitive they became to the temporal misalignment of visual and auditory signals. Collectively, our findings demonstrate that music practicing fine-tunes the audiovisual temporal integration window to various extents depending on the stimulus class. While the effect of piano practicing was most pronounced for music, it also generalized to other stimulus classes such as sinewave speech and to a marginally significant degree to natural speech.
The general aim of this thesis was to test the effects of paralinguistic (emotional) and prior contextual (topical) cues on perception of poorly specified visual, auditory, and audiovisual speech. The specific purposes were to (1) examine if facially displayed emotions can facilitate speechreading performance; (2) to study the mechanism for such facilitation; (3) to map information-processing factors that are involved in processing of poorly specified speech; and (4) to present a comprehensiv...
We present a modular framework for articulatory animation synthesis using speech motion capture data obtained with electromagnetic articulography (EMA). Adapting a skeletal animation approach, the articulatory motion data is applied to a three-dimensional (3D) model of the vocal tract, creating a portable resource that can be integrated in an audiovisual (AV) speech synthesis platform to provide realistic animation of the tongue and teeth for a virtual character. The framework also provides an interface to articulatory animation synthesis, as well as an example application to illustrate its use with a 3D game engine. We rely on cross-platform, open-source software and open standards to provide a lightweight, accessible, and portable workflow.
Full Text Available Events encoded in separate sensory modalities, such as audition and vision, can seem to be synchronous across a relatively broad range of physical timing differences. This may suggest that the precision of audio-visual timing judgments is inherently poor. Here we show that this is not necessarily true. We contrast timing sensitivity for isolated streams of audio and visual speech, and for streams of audio and visual speech accompanied by additional, temporally offset, visual speech streams. We find that the precision with which synchronous streams of audio and visual speech are identified is enhanced by the presence of additional streams of asynchronous visual speech. Our data suggest that timing perception is shaped by selective grouping processes, which can result in enhanced precision in temporally cluttered environments. The imprecision suggested by previous studies might therefore be a consequence of examining isolated pairs of audio and visual events. We argue that when an isolated pair of cross-modal events is presented, they tend to group perceptually and to seem synchronous as a consequence. We have revealed greater precision by providing multiple visual signals, possibly allowing a single auditory speech stream to group selectively with the most synchronous visual candidate. The grouping processes we have identified might be important in daily life, such as when we attempt to follow a conversation in a crowded room.
Evitts, Paul M.; Portugal, Lindsay; Van Dine, Ami; Holler, Aline
Background: There is minimal research on the contribution of visual information on speech intelligibility for individuals with a laryngectomy (IWL). Aims: The purpose of this project was to determine the effects of mode of presentation (audio-only, audio-visual) on alaryngeal speech intelligibility. Method: Twenty-three naive listeners were…
Kubicek, Claudia; Hillairet de Boisferon, Anne; Dupierrix, Eve; Pascalis, Olivier; Lœvenbruck, Hélène; Gervain, Judit; Schwarzer, Gudrun
The present study examined when and how the ability to cross-modally match audio-visual fluent speech develops in 4.5-, 6- and 12-month-old German-learning infants. In Experiment 1, 4.5- and 6-month-old infants' audio-visual matching ability of native (German) and non-native (French) fluent speech was assessed by presenting auditory and visual speech information sequentially, that is, in the absence of temporal synchrony cues. The results showed that 4.5-month-old infants were capable of matc...
Choudhury, N.; Amer, I; Daniels, M; Wareing, MJ
Introduction Aural microsuction is a common ear, nose and throat procedure used in the outpatient setting. Some patients, however, find it difficult to tolerate owing to discomfort, pain or noise. This study evaluated the effect of audiovisual distraction on patients’ pain perception and overall satisfaction. Methods A prospective study was conducted for patients attending our aural care clinic requiring aural toileting of bilateral mastoid cavities over a three-month period. All microsuction...
Andersen, Tobias; Starrfelt, Randi
perception. While these studies have focused on auditory speech perception other studies have shown that Broca's area is activated by visual speech perception. Furthermore, one preliminary report found that a patient with Broca's aphasia did not experience the McGurk illusion suggesting that an intact Broca......Lesions to Broca's area cause aphasia characterized by a severe impairment of the ability to speak, with comparatively intact speech perception. However, some studies have found effects on speech perception under adverse listening conditions, indicating that Broca's area is also involved in speech...
Vouloumanos, Athena; Gelfand, Hanna M.
The ability to decode atypical and degraded speech signals as intelligible is a hallmark of speech perception. Human adults can perceive sounds as speech even when they are generated by a variety of nonhuman sources including computers and parrots. We examined how infants perceive the speech-like vocalizations of a parrot. Further, we examined how…
Maganti, Hari Krishna; Gatica-Perez, Daniel; McCowan, Iain A.
We address the problem of distant speech acquisition in multi-party meetings, using multiple microphones and cameras. Microphone array beamforming techniques present a potential alternative to close-talking microphones by providing speech enhancement through spatial filtering and directional discrimination. Beamforming techniques rely on the knowledge of a speaker location. In this paper, we present an integrated approach, in which an audio-visual multi-person tracker is used to track active ...
Adank, Patti; Nuttall, Helen E.; Banks, Briony; Kennedy-Higgins, Daniel
The recognition of unfamiliar regional and foreign accents represents a challenging task for the speech perception system (Floccia et al., 2006; Adank et al., 2009). Despite the frequency with which we encounter such accents, the neural mechanisms supporting successful perception of accented speech are poorly understood. Nonetheless, candidate neural substrates involved in processing speech in challenging listening conditions, including accented speech, are beginning to be identified. This re...
Attigodu, Ganesh; Berthommier, Frédéric; Nahorna, Olha; Schwartz, Jean-Luc
In a previous set of experiments we showed that audio-visual fusion during the McGurk effect may be modulated by context. A short context (2 to 4 syllables) composed of incoherent auditory and visual material significantly decreases the McGurk effect. We interpreted this as showing the existence of an audiovisual "binding" stage controlling the fusion process, and we also showed the existence of a "rebinding" process when an incoherent material is followed by a short coherent material. In thi...
Lidestam, Björn; Moradi, Shahram; Pettersson, Rasmus; Ricklefs, Theodor
The effects of audiovisual versus auditory training for speech-in-noise identification were examined in 60 young participants. The training conditions were audiovisual training, auditory-only training, and no training (n = 20 each). In the training groups, gated consonants and words were presented at 0 dB signal-to-noise ratio; stimuli were either audiovisual or auditory-only. The no-training group watched a movie clip without performing a speech identification task. Speech-in-noise identific...
Aleksic Petar S
Full Text Available We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR experiments. Both single-stream and multistream hidden Markov models (HMMs were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0–30 dB with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.
Bruderer, Alison G; Danielson, D Kyle; Kandhadai, Padmapriya; Werker, Janet F
The influence of speech production on speech perception is well established in adults. However, because adults have a long history of both perceiving and producing speech, the extent to which the perception-production linkage is due to experience is unknown. We addressed this issue by asking whether articulatory configurations can influence infants' speech perception performance. To eliminate influences from specific linguistic experience, we studied preverbal, 6-mo-old infants and tested the discrimination of a nonnative, and hence never-before-experienced, speech sound distinction. In three experimental studies, we used teething toys to control the position and movement of the tongue tip while the infants listened to the speech sounds. Using ultrasound imaging technology, we verified that the teething toys consistently and effectively constrained the movement and positioning of infants' tongues. With a looking-time procedure, we found that temporarily restraining infants' articulators impeded their discrimination of a nonnative consonant contrast but only when the relevant articulator was selectively restrained to prevent the movements associated with producing those sounds. Our results provide striking evidence that even before infants speak their first words and without specific listening experience, sensorimotor information from the articulators influences speech perception. These results transform theories of speech perception by suggesting that even at the initial stages of development, oral-motor movements influence speech sound discrimination. Moreover, an experimentally induced "impairment" in articulator movement can compromise speech perception performance, raising the question of whether long-term oral-motor impairments may impact perceptual development. PMID:26460030
Lim, Jung-Hui; Oh, Do-Kwan; Lee, Soo-Young
In noisy environment the human speech perception utilizes visual lip-reading as well as audio phonetic classification. This audio-visual integration may be done by combining the two sensory features at the early stage. Also, the top-down attention may integrate the two modalities. For the sensory feature fusion we introduce mapping functions between the audio and visual manifolds. Especially, we present an algorithm to provide one-to-many mapping function for the videoto- audio mapping. The top-down attention is also presented to integrate both the sensory features and classification results of both modalities, which is able to explain McGurk effect. Each classifier is separately implemented by the Hidden-Markov Model (HMM), but the two classifiers are combined at the top level and interact by the top-down attention.
Tong Ming; Bian Zhengzhong; Li Xiaohui; Dai Qijun; Chen Yanpu
The perceptual effect of the phase information in speech has been studied by auditorysubjective tests. On the condition that the phase spectrum in speech is changed while amplitudespectrum is unchanged, the tests show that: (1) If the envelop of the reconstructed speech signalis unchanged, there is indistinctive auditory perception between the original speech and thereconstructed speech; (2) The auditory perception effect of the reconstructed speech mainly lieson the amplitude of the derivative of the additive phase; (3) td is the maximum relative time shiftbetween different frequency components of the reconstructed speech signal. The speech qualityis excellent while td ＜10ms; good while 10ms＜ td ＜20ms; common while 20ms＜ td ＜35ms, andpoor while td ＞35ms.
Ross, Lars A; Molholm, Sophie; Blanco, Daniella; Gomez-Ramirez, Manuel; Saint-Amour, Dave; Foxe, John J
Observing a speaker's articulations substantially improves the intelligibility of spoken speech, especially under noisy listening conditions. This multisensory integration of speech inputs is crucial to effective communication. Appropriate development of this ability has major implications for children in classroom and social settings, and deficits in it have been linked to a number of neurodevelopmental disorders, especially autism. It is clear from structural imaging studies that there is a prolonged maturational course within regions of the perisylvian cortex that persists into late childhood, and these regions have been firmly established as being crucial to speech and language functions. Given this protracted maturational timeframe, we reasoned that multisensory speech processing might well show a similarly protracted developmental course. Previous work in adults has shown that audiovisual enhancement in word recognition is most apparent within a restricted range of signal-to-noise ratios (SNRs). Here, we investigated when these properties emerge during childhood by testing multisensory speech recognition abilities in typically developing children aged between 5 and 14 years, and comparing them with those of adults. By parametrically varying SNRs, we found that children benefited significantly less from observing visual articulations, displaying considerably less audiovisual enhancement. The findings suggest that improvement in the ability to recognize speech-in-noise and in audiovisual integration during speech perception continues quite late into the childhood years. The implication is that a considerable amount of multisensory learning remains to be achieved during the later schooling years, and that explicit efforts to accommodate this learning may well be warranted. PMID:21615556
Full Text Available OBJECTIVE: To analyze speech reading through Internet video calls by profoundly hearing-impaired individuals and cochlear implant (CI users. METHODS: Speech reading skills of 14 deaf adults and 21 CI users were assessed using the Hochmair Schulz Moser (HSM sentence test. We presented video simulations using different video resolutions (1280 × 720, 640 × 480, 320 × 240, 160 × 120 px, frame rates (30, 20, 10, 7, 5 frames per second (fps, speech velocities (three different speakers, webcameras (Logitech Pro9000, C600 and C500 and image/sound delays (0-500 ms. All video simulations were presented with and without sound and in two screen sizes. Additionally, scores for live Skype™ video connection and live face-to-face communication were assessed. RESULTS: Higher frame rate (>7 fps, higher camera resolution (>640 × 480 px and shorter picture/sound delay (<100 ms were associated with increased speech perception scores. Scores were strongly dependent on the speaker but were not influenced by physical properties of the camera optics or the full screen mode. There is a significant median gain of +8.5%pts (p = 0.009 in speech perception for all 21 CI-users if visual cues are additionally shown. CI users with poor open set speech perception scores (n = 11 showed the greatest benefit under combined audio-visual presentation (median speech perception +11.8%pts, p = 0.032. CONCLUSION: Webcameras have the potential to improve telecommunication of hearing-impaired individuals.
Jesse, A.; Janse, E.
Older listeners are more affected than younger listeners in their recognition of speech in adverse conditions, such as when they also hear a single-competing speaker. In the present study, we investigated with a speeded response task whether older listeners with various degrees of hearing loss benefit under such conditions from also seeing the speaker they intend to listen to. We also tested, at the same time, whether older adults need postperceptual processing to obtain an audiovisual benefi...
Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.
Pisoni, David B.
The goal was to acquire new knowledge about speech perception and production in severe environments such as high masking noise, increased cognitive load or sustained attentional demands. Changes were examined in speech production under these adverse conditions through acoustic analysis techniques. One set of studies focused on the effects of noise on speech production. The experiments in this group were designed to generate a database of speech obtained in noise and in quiet. A second set of experiments was designed to examine the effects of cognitive load on the acoustic-phonetic properties of speech. Talkers were required to carry out a demanding perceptual motor task while they read lists of test words. A final set of experiments explored the effects of vocal fatigue on the acoustic-phonetic properties of speech. Both cognitive load and vocal fatigue are present in many applications where speech recognition technology is used, yet their influence on speech production is poorly understood.
Wang, DeLiang; Kjems, Ulrik; Pedersen, Michael Syskind;
For a given mixture of speech and noise, an ideal binary time-frequency mask is constructed by comparing speech energy and noise energy within local time-frequency units. It is observed that listeners achieve nearly perfect speech recognition from gated noise with binary gains prescribed by the i...... by the ideal binary mask. Only 16 filter channels and a frame rate of 100 Hz are sufficient for high intelligibility. The results show that, despite a dramatic reduction of speech information, a pattern of binary gains provides an adequate basis for speech perception....
Clement, Bart Richard
Although speech perception has been considered a predominantly auditory phenomenon, large benefits from vision in degraded acoustic conditions suggest integration of audition and vision. More direct evidence of this comes from studies of audiovisual disparity that demonstrate vision can bias and even dominate perception (McGurk & MacDonald, 1976). It has been observed that hearing-impaired listeners demonstrate more visual biasing than normally hearing listeners (Walden et al., 1990). It is argued here that stimulus audibility must be equated across groups before true differences can be established. In the present investigation, effects of visual biasing on perception were examined as audibility was degraded for 12 young normally hearing listeners. Biasing was determined by quantifying the degree to which listener identification functions for a single synthetic auditory /ba-da-ga/ continuum changed across two conditions: (1)an auditory-only listening condition; and (2)an auditory-visual condition in which every item of the continuum was synchronized with visual articulations of the consonant-vowel (CV) tokens /ba/ and /ga/, as spoken by each of two talkers. Audibility was altered by presenting the conditions in quiet and in noise at each of three signal-to- noise (S/N) ratios. For the visual-/ba/ context, large effects of audibility were found. As audibility decreased, visual biasing increased. A large talker effect also was found, with one talker eliciting more biasing than the other. An independent lipreading measure demonstrated that this talker was more visually intelligible than the other. For the visual-/ga/ context, audibility and talker effects were less robust, possibly obscured by strong listener effects, which were characterized by marked differences in perceptual processing patterns among participants. Some demonstrated substantial biasing whereas others demonstrated little, indicating a strong reliance on audition even in severely degraded acoustic
Lynne E Bernstein
Full Text Available This paper examines the questions, what levels of speech can be perceived visually, and how is visual speech represented by the brain? Review of the literature leads to the conclusions that every level of psycholinguistic speech structure (i.e., phonetic features, phonemes, syllables, words, and prosody can be perceived visually, although individuals differ in their abilities to do so; and that there are visual modality-specific representations of speech qua speech in higher-level vision brain areas. That is, the visual system represents the modal patterns of visual speech. The suggestion that the auditory speech pathway receives and represents visual speech is examined in light of neuroimaging evidence on the auditory speech pathways. We outline the generally agreed-upon organization of the visual ventral and dorsal pathways and examine several types of visual processing that might be related to speech through those pathways, specifically, face and body, orthography, and sign language processing. In this context, we examine the visual speech processing literature, which reveals widespread diverse patterns activity in posterior temporal cortices in response to visual speech stimuli. We outline a model of the visual and auditory speech pathways and make several suggestions: (1 The visual perception of speech relies on visual pathway representations of speech qua speech. (2 A proposed site of these representations, the temporal visual speech area (TVSA has been demonstrated in posterior temporal cortex, ventral and posterior to multisensory posterior superior temporal sulcus (pSTS. (3 Given that visual speech has dynamic and configural features, its representations in feedforward visual pathways are expected to integrate these features, possibly in TVSA.
Full Text Available One view of speech perception is that acoustic signals are transformed into representations for pattern matching to determine linguistic structure. This process can be taken as a statistical pattern-matching problem, assuming realtively stable linguistic categories are characterized by neural representations related to auditory properties of speech that can be compared to speech input. This kind of pattern matching can be termed a passive process which implies rigidity of processingd with few demands on cognitive processing. An alternative view is that speech recognition, even in early stages, is an active process in which speech analysis is attentionally guided. Note that this does not mean consciously guided but that information-contingent changes in early auditory encoding can occur as a function of context and experience. Active processing assumes that attention, plasticity, and listening goals are important in considering how listeners cope with adverse circumstances that impair hearing by masking noise in the environment or hearing loss. Although theories of speech perception have begun to incorporate some active processing, they seldom treat early speech encoding as plastic and attentionally guided. Recent research has suggested that speech perception is the product of both feedforward and feedback interactions between a number of brain regions that include descending projections perhaps as far downstream as the cochlea. It is important to understand how the ambiguity of the speech signal and constraints of context dynamically determine cognitive resources recruited during perception including focused attention, learning, and working memory. Theories of speech perception need to go beyond the current corticocentric approach in order to account for the intrinsic dynamics of the auditory encoding of speech. In doing so, this may provide new insights into ways in which hearing disorders and loss may be treated either through augementation or
Howard Charles Nusbaum
One view of speech perception is that acoustic signals are transformed into representations for pattern matching to determine linguistic structure. This process can be taken as a statistical pattern-matching problem, assuming realtively stable linguistic categories are characterized by neural representations related to auditory properties of speech that can be compared to speech input. This kind of pattern matching can be termed a passive process which implies rigidity of processingd with few...
Little is known about the perception of speech sounds by native Danish listeners. However, the Danish sound system differs in several interesting ways from the sound systems of other languages. For instance, Danish is characterized, among other features, by a rich vowel inventory and by different...... reductions of speech sounds evident in the pronunciation of the language. This book (originally a PhD thesis) consists of three studies based on the results of two experiments. The experiments were designed to provide knowledge of the perception of Danish speech sounds by Danish adults and infants......, in the light of the rich and complex Danish sound system. The first two studies report on native adults’ perception of Danish speech sounds in quiet and noise. The third study examined the development of language-specific perception in native Danish infants at 6, 9 and 12 months of age. The book points...
Homae, Fumitaka; Watanabe, Hama; Taga, Gentaro
Infants often pay special attention to speech sounds, and they appear to detect key features of these sounds. To investigate the neural foundation of speech perception in infants, we measured cortical activation using near-infrared spectroscopy. We presented the following three types of auditory stimuli while 3-month-old infants watched a silent…
Lotto, Andrew J.; Hickok, Gregory S.; Holt, Lori L.
The discovery of mirror neurons, a class of neurons that respond when a monkey performs an action and also when the monkey observes others producing the same action, has promoted a renaissance for the Motor Theory (MT) of speech perception. This is because mirror neurons seem to accomplish the same kind of one to one mapping between perception and action that MT theorizes to be the basis of human speech communication. However, this seeming correspondence is superficial, and there are theoreti...
Lusk, Laina G; Mitchel, Aaron D
Speech is inextricably multisensory: both auditory and visual components provide critical information for all aspects of speech processing, including speech segmentation, the visual components of which have been the target of a growing number of studies. In particular, a recent study (Mitchel and Weiss, 2014) established that adults can utilize facial cues (i.e., visual prosody) to identify word boundaries in fluent speech. The current study expanded upon these results, using an eye tracker to identify highly attended facial features of the audiovisual display used in Mitchel and Weiss (2014). Subjects spent the most time watching the eyes and mouth. A significant trend in gaze durations was found with the longest gaze duration on the mouth, followed by the eyes and then the nose. In addition, eye-gaze patterns changed across familiarization as subjects learned the word boundaries, showing decreased attention to the mouth in later blocks while attention on other facial features remained consistent. These findings highlight the importance of the visual component of speech processing and suggest that the mouth may play a critical role in visual speech segmentation. PMID:26869959
van Hoesel, Richard J. M.
One of the key benefits of using cochlear implants (CIs) in both ears rather than just one is improved localization. It is likely that in complex listening scenes, improved localization allows bilateral CI users to orient toward talkers to improve signal-to-noise ratios and gain access to visual cues, but to date, that conjecture has not been tested. To obtain an objective measure of that benefit, seven bilateral CI users were assessed for both auditory-only and audio-visual speech intelligib...
Full Text Available It has been shown that integration of acoustic and visual information especially in noisy conditions yields improved speech recognition results. This raises the question of how to weight the two modalities in different noise conditions. Throughout this paper we develop a weighting process adaptive to various background noise situations. In the presented recognition system, audio and video data are combined following a Separate Integration (SI architecture. A hybrid Artificial Neural Network/Hidden Markov Model (ANN/HMM system is used for the experiments. The neural networks were in all cases trained on clean data. Firstly, we evaluate the performance of different weighting schemes in a manually controlled recognition task with different types of noise. Next, we compare different criteria to estimate the reliability of the audio stream. Based on this, a mapping between the measurements and the free parameter of the fusion process is derived and its applicability is demonstrated. Finally, the possibilities and limitations of adaptive weighting are compared and discussed.
Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.
Mitterer, H.; Scharenborg, O.; McQueen, J
Recent evidence shows that listeners use abstract prelexical units in speech perception. Using the phenomenon of lexical retuning in speech processing, we ask whether those units are necessarily phonemic. Dutch listeners were exposed to a Dutch speaker producing ambiguous phones between the Dutch syllable-final allophones approximant [r] and dark [l]. These ambiguous phones replaced either final /r/ or final /l/ in words in a lexical-decision task. This differential exposure affected percepti...
Rhone, Ariane E.; Nourski, Kirill V.; Oya, Hiroyuki; Kawasaki, Hiroto; Howard, Matthew A.; McMurray, Bob
In everyday conversation, viewing a talker's face can provide information about the timing and content of an upcoming speech signal, resulting in improved intelligibility. Using electrocorticography, we tested whether human auditory cortex in Heschl's gyrus (HG) and on superior temporal gyrus (STG) and motor cortex on precentral gyrus (PreC) were responsive to visual/gestural information prior to the onset of sound and whether early stages of auditory processing were sensitive to the visual content (speech syllable versus non-speech motion). Event-related band power (ERBP) in the high gamma band was content-specific prior to acoustic onset on STG and PreC, and ERBP in the beta band differed in all three areas. Following sound onset, we found with no evidence for content-specificity in HG, evidence for visual specificity in PreC, and specificity for both modalities in STG. These results support models of audio-visual processing in which sensory information is integrated in non-primary cortical areas. PMID:27182530
Gordon, Peter C.
Four experiments addressing the role of attention in phonetic perception are reported. The first experiment shows that the relative importance of two cues to the voicing distinction changes when subjects must perform an arithmetic distractor task at the same time as identifying a speech stimulus. The voice onset time cue loses phonetic significance when subjects are distracted, while the F0 onset frequency cue does not. The second experiment shows a similar pattern for two cues to the distinction between the vowels /i/ (as in 'beat') and /I/ (as in 'bit'). Together these experiments indicate that careful attention to speech perception is necessary for strong acoustic cues to achieve their full phonetic impact, while weaker acoustic cues achieve their full phonetic impact without close attention. Experiment 3 shows that this pattern is obtained when the distractor task places little demand on verbal short term memory. Experiment 4 provides a large data set for testing formal models of the role of attention in speech perception. Attention is shown to influence the signal to noise ratio in phonetic encoding. This principle is instantiated in a network model in which the role of attention is to reduce noise in the phonetic encoding of acoustic cues. Implications of this work for understanding speech perception and general theories of the role of attention in perception are discussed.
Alan James Power
Full Text Available Auditory cortical oscillations have been proposed to play an important role in speech perception. It is suggested that the brain may take temporal ‘samples’ of information from the speech stream at different rates, phase-resetting ongoing oscillations so that they are aligned with similar frequency bands in the input (‘phase locking’. Information from these frequency bands is then bound together for speech perception. To date, there are no explorations of neural phase-locking and entrainment to speech input in children. However, it is clear from studies of language acquisition that infants use both visual speech information and auditory speech information in learning. In order to study neural entrainment to speech in typically-developing children, we use a rhythmic entrainment paradigm (underlying 2 Hz or delta rate based on repetition of the syllable ba, presented in either the auditory modality alone, the visual modality alone, or as auditory-visual speech (via a talking head. To ensure attention to the task, children aged 13 years were asked to press a button as fast as possible when the ba stimulus violated the rhythm for each stream type. Rhythmic violation depended on delaying the occurrence of a ba in the isochronous stream. Neural entrainment was demonstrated for all stream types, and individual differences in standardized measures of language processing were related to auditory entrainment at the theta rate. Further, there was significant modulation of the preferred phase of auditory entrainment in the theta band when visual speech cues were present, indicating cross-modal phase resetting. The rhythmic entrainment paradigm developed here offers a method for exploring individual differences in oscillatory phase locking during development. In particular, a method for assessing neural entrainment and cross-modal phase resetting would be useful for exploring developmental learning difficulties thought to involve temporal sampling
Fava, Eswen; Hull, Rachel; Bortfeld, Heather
Initially, infants are capable of discriminating phonetic contrasts across the world's languages. Starting between seven and ten months of age, they gradually lose this ability through a process of perceptual narrowing. Although traditionally investigated with isolated speech sounds, such narrowing occurs in a variety of perceptual domains (e.g., faces, visual speech). Thus far, tracking the developmental trajectory of this tuning process has been focused primarily on auditory speech alone, and generally using isolated sounds. But infants learn from speech produced by people talking to them, meaning they learn from a complex audiovisual signal. Here, we use near-infrared spectroscopy to measure blood concentration changes in the bilateral temporal cortices of infants in three different age groups: 3-to-6 months, 7-to-10 months, and 11-to-14-months. Critically, all three groups of infants were tested with continuous audiovisual speech in both their native and another, unfamiliar language. We found that at each age range, infants showed different patterns of cortical activity in response to the native and non-native stimuli. Infants in the youngest group showed bilateral cortical activity that was greater overall in response to non-native relative to native speech; the oldest group showed left lateralized activity in response to native relative to non-native speech. These results highlight perceptual tuning as a dynamic process that happens across modalities and at different levels of stimulus complexity. PMID:25116572
Full Text Available Initially, infants are capable of discriminating phonetic contrasts across the world’s languages. Starting between seven and ten months of age, they gradually lose this ability through a process of perceptual narrowing. Although traditionally investigated with isolated speech sounds, such narrowing occurs in a variety of perceptual domains (e.g., faces, visual speech. Thus far, tracking the developmental trajectory of this tuning process has been focused primarily on auditory speech alone, and generally using isolated sounds. But infants learn from speech produced by people talking to them, meaning they learn from a complex audiovisual signal. Here, we use near-infrared spectroscopy to measure blood concentration changes in the bilateral temporal cortices of infants in three different age groups: 3-to-6 months, 7-to-10 months, and 11-to-14-months. Critically, all three groups of infants were tested with continuous audiovisual speech in both their native and another, unfamiliar language. We found that at each age range, infants showed different patterns of cortical activity in response to the native and non-native stimuli. Infants in the youngest group showed bilateral cortical activity that was greater overall in response to non-native relative to native speech; the oldest group showed left lateralized activity in response to native relative to non-native speech. These results highlight perceptual tuning as a dynamic process that happens across modalities and at different levels of stimulus complexity.
Full Text Available The use of visual cues during the processing of audiovisual speech is known to be less efficient in children and adults with language difficulties and difficulties are known to be more prevalent in children from low-income populations. In the present study, we followed an economically diverse group of thirty-seven infants longitudinally from 6-9 months to 14-16 months of age. We used eye-tracking to examine whether individual differences in visual attention during audiovisual processing of speech in 6 to 9 month old infants, particularly when processing congruent and incongruent auditory and visual speech cues, might be indicative of their later language development. Twenty-two of these 6-9 month old infants also participated in an event-related potential (ERP audiovisual task within the same experimental session. Language development was then followed-up at the age of 14-16 months, using two measures of language development, the Preschool Language Scale (PLS and the Oxford Communicative Development Inventory (CDI. The results show that those infants who were less efficient in auditory speech processing at the age of 6-9 months had lower receptive language scores at 14-16 months. A correlational analysis revealed that the pattern of face scanning and ERP responses to audio-visually incongruent stimuli at 6-9 months were both significantly associated with language development at 14-16 months. These findings add to the understanding of individual differences in neural signatures of audiovisual processing and associated looking behaviour in infants.
McGowan, Kevin B
Listeners' use of social information during speech perception was investigated by measuring transcription accuracy of Chinese-accented speech in noise while listeners were presented with a congruent Chinese face, an incongruent Caucasian face, or an uninformative silhouette. When listeners were presented with a Chinese face they transcribed more accurately than when presented with the Caucasian face. This difference existed both for listeners with a relatively high level of experience and for listeners with a relatively low level of experience with Chinese-accented English. Overall, these results are inconsistent with a model of social speech perception in which listener bias reduces attendance to the acoustic signal. These results are generally consistent with exemplar models of socially indexed speech perception predicting that activation of a social category will raise base activation levels of socially appropriate episodic traces, but the similar performance of more and less experienced listeners suggests the need for a more nuanced view with a role for both detailed experience and listener stereotypes. PMID:27483742
The emergence of heterogeneous networks and the rapid increase of Voice over IP (VoIP) applications provide important opportunities for the telecommunications market. These opportunities come at the price of increased complexity in the monitoring of the quality of service (QoS) and the need for adaptation of transmission systems to the changing environmental conditions. This thesis contains three papers concerned with quality assessment and enhancement of speech communication systems in adver...
Mitterer, Holger; Scharenborg, Odette; McQueen, James M
Recent evidence shows that listeners use abstract prelexical units in speech perception. Using the phenomenon of lexical retuning in speech processing, we ask whether those units are necessarily phonemic. Dutch listeners were exposed to a Dutch speaker producing ambiguous phones between the Dutch syllable-final allophones approximant [r] and dark [l]. These ambiguous phones replaced either final /r/ or final /l/ in words in a lexical-decision task. This differential exposure affected perception of ambiguous stimuli on the same allophone continuum in a subsequent phonetic-categorization test: Listeners exposed to ambiguous phones in /r/-final words were more likely to perceive test stimuli as /r/ than listeners with exposure in /l/-final words. This effect was not found for test stimuli on continua using other allophones of /r/ and /l/. These results confirm that listeners use phonological abstraction in speech perception. They also show that context-sensitive allophones can play a role in this process, and hence that context-insensitive phonemes are not necessary. We suggest there may be no one unit of perception. PMID:23973464
Campos-Sánchez, Antonio; López-Núñez, Juan-Antonio; Scionti, Giuseppe; Garzón, Ingrid; González-Andrades, Miguel; Alaminos, Miguel; Sola, Tomás
Videos can be used as didactic tools for self-learning under several circumstances, including those cases in which students are responsible for the development of this resource as an audiovisual notebook. We compared students' and teachers' perceptions regarding the main features that an audiovisual notebook should include. Four…
Full Text Available Recent magneto-encephalographic and electro-encephalographic studies provide evidence for cross-modal integration during audio-visual and audio-haptic speech perception, with speech gestures viewed or felt from manual tactile contact with the speaker’s face. Given the temporal precedence of the haptic and visual signals on the acoustic signal in these studies, the observed modulation of N1/P2 auditory evoked responses during bimodal compared to unimodal speech perception suggest that relevant and predictive visual and haptic cues may facilitate auditory speech processing. To further investigate this hypothesis, auditory evoked potentials were here compared during auditory-only, audio-visual and audio-haptic speech perception in live dyadic interactions between a listener and a speaker. In line with previous studies, auditory evoked potentials were attenuated and speeded up during both audio-haptic and audio-visual compared to auditory speech perception. Importantly, the observed latency and amplitude reduction did not significantly depend on the degree of visual and haptic recognition of the speech targets. Altogether, these results further demonstrate cross-modal interactions between the auditory, visual and haptic speech signals. Although they do not contradict the hypothesis that visual and haptic sensory inputs convey predictive information with respect to the incoming auditory speech input, these results suggest that, at least in live conversational interactions, systematic conclusions on sensory predictability in bimodal speech integration have to be taken with caution, with the extraction of predictive cues likely depending on the variability of the speech stimuli.
This book presents a new approach to examining perceived quality of audiovisual sequences. It uses electroencephalography to understand how exactly user quality judgments are formed within a test participant, and what might be the physiologically-based implications when being exposed to lower quality media. The book redefines experimental paradigms of using EEG in the area of quality assessment so that they better suit the requirements of standard subjective quality testings. Therefore, experimental protocols and stimuli are adjusted accordingly. .
Chen, Yi-Chuan; Shore, David I; Lewis, Terri L; Maurer, Daphne
We measured the typical developmental trajectory of the window of audiovisual simultaneity by testing four age groups of children (5, 7, 9, and 11 years) and adults. We presented a visual flash and an auditory noise burst at various stimulus onset asynchronies (SOAs) and asked participants to report whether the two stimuli were presented at the same time. Compared with adults, children aged 5 and 7 years made more simultaneous responses when the SOAs were beyond ± 200 ms but made fewer simultaneous responses at the 0 ms SOA. The point of subjective simultaneity was located at the visual-leading side, as in adults, by 5 years of age, the youngest age tested. However, the window of audiovisual simultaneity became narrower and response errors decreased with age, reaching adult levels by 9 years of age. Experiment 2 ruled out the possibility that the adult-like performance of 9-year-old children was caused by the testing of a wide range of SOAs. Together, the results demonstrate that the adult-like precision of perceiving audiovisual simultaneity is developed by 9 years of age, the youngest age that has been reported to date. PMID:26897264
Shahin, Antoine J.
Does musical training affect our perception of speech? For example, does learning to play a musical instrument modify the neural circuitry for auditory processing in a way that improves one’s ability to perceive speech more clearly in noisy environments? If so, can speech perception in individuals with hearing loss, who struggle in noisy situations, benefit from musical training? While music and speech exhibit some specialization in neural processing, there is evidence suggesting that skill...
Holt, Lori L.; Lotto, Andrew J.
The complexities of the acoustic speech signal pose many significant challenges for listeners. Although perceiving speech begins with auditory processing, investigation of speech perception has progressed mostly independently of study of the auditory system. Nevertheless, a growing body of evidence demonstrates that cross-fertilization between the two areas of research can be productive. We briefly describe research bridging the study of general auditory processing and speech perception, show...
Fernandez Pradier, Melanie
This thesis deals with emotion recognition from speech signals. The feature extraction step shall be improved by looking at the perception of music. In music theory, different pitch intervals (consonant, dissonant) and chords are believed to invoke different feelings in listeners. The question is whether there is a similar mechanism between perception of music and perception of emotional speech. Our research will follow three stages. First, the relationship between speech and music at segment...
Lolli, Sydney L.; Lewenstein, Ari D.; Basurto, Julian; Winnik, Sean; Loui, Psyche
Congenital amusics, or “tone-deaf” individuals, show difficulty in perceiving and producing small pitch differences. While amusia has marked effects on music perception, its impact on speech perception is less clear. Here we test the hypothesis that individual differences in pitch perception affect judgment of emotion in speech, by applying low-pass filters to spoken statements of emotional speech. A norming study was first conducted on Mechanical Turk to ensure that the intended emotions fro...
Sydney eLolli; Lewenstein, Ari D.; Julian eBasurto; Sean eWinnik; Psyche eLoui
Congenital amusics, or tone-deaf individuals, show difficulty in perceiving and producing small pitch differences. While amusia has marked effects on music perception, its impact on speech perception is less clear. Here we test the hypothesis that individual differences in pitch perception affect judgment of emotion in speech, by applying band-pass filters to spoken statements of emotional speech. A norming study was first conducted on Mechanical Turk to ensure that the intended emotions from...
Campos Sanchez, Antonio; López Núñez, Juan-Antonio; Scionti, Giuseppe; Garzón, Ingrid; González Andrades, Miguel; Alaminos, Miguel; Sola, Tomás
Videos can be used as didactic tools for self-learning under several circumstances, including those cases in which students are responsible for the development of this resource as an audiovisual notebook. We compared students' and teachers' perceptions regarding the main features that an audiovisual notebook should include. Four questionnaires with items about information, images, text and music, and filmmaking were used to investigate students' (n¿=¿115) and teachers' perceptions (n¿=¿28) re...
Kubicek, Claudia; Gervain, Judit; Hillairet de Boisferon, Anne; Pascalis, Olivier; Lœvenbruck, Hélène; Schwarzer, Gudrun
The present study examined whether infant-directed (ID) speech facilitates intersensory matching of audio--visual fluent speech in 12-month-old infants. German-learning infantsÃ¢ÂÂ audio--visual matching ability of German and French fluent speech was assessed by using a variant of the intermodal matching procedure, with auditory and visual speech information presented sequentially. In Experiment 1, the sentences were spoken in an adult-directed (AD) manner. Results showed that 12-month-old ...
Hulusi Kafaligonul; Can Oluk
Motion perception is a pervasive nature of vision and is affected by both immediate pattern of sensory inputs and prior experiences acquired through associations. Recently, several studies reported that an association can be established quickly between directions of visual motion and static sounds of distinct frequencies. After the association is formed, sounds are able to change the perceived direction of visual motion. To determine whether such rapidly acquired audiovisual associations and ...
Full Text Available Motion perception is a pervasive nature of vision and is affected by both immediate pattern of sensory inputs and prior experiences acquired through associations. Recently, several studies reported that an association can be established quickly between directions of visual motion and static sounds of distinct frequencies. After the association is formed, sounds are able to change the perceived direction of visual motion. To determine whether such rapidly acquired audiovisual associations and their subsequent influences on visual motion perception are dependent on the involvement of higher-order attentive tracking mechanisms, we designed psychophysical experiments using regular and reverse-phi random dot motions isolating low-level pre-attentive motion processing. Our results show that an association between the directions of low-level visual motion and static sounds can be formed and this audiovisual association alters the subsequent perception of low-level visual motion. These findings support the view that audiovisual associations are not restricted to high-level attention based motion system and early-level visual motion processing has some potential role.
Preston, Jonathan L; Irwin, Julia R; Turcios, Jacqueline
Children with speech sound disorders may perceive speech differently than children with typical speech development. The nature of these speech differences is reviewed with an emphasis on assessing phoneme-specific perception for speech sounds that are produced in error. Category goodness judgment, or the ability to judge accurate and inaccurate tokens of speech sounds, plays an important role in phonological development. The software Speech Assessment and Interactive Learning System, which has been effectively used to assess preschoolers' ability to perform goodness judgments, is explored for school-aged children with residual speech errors (RSEs). However, data suggest that this particular task may not be sensitive to perceptual differences in school-aged children. The need for the development of clinical tools for assessment of speech perception in school-aged children with RSE is highlighted, and clinical suggestions are provided. PMID:26458198
Burfin, Sabine; Pascalis, Olivier; Ruiz Tada, Elisa; Costa, Albert; Savariaux, Christophe; Kandel, Sonia
We all go through a process of perceptual narrowing for phoneme identification. As we become experts in the languages we hear in our environment we lose the ability to identify phonemes that do not exist in our native phonological inventory. This research examined how linguistic experience—i.e., the exposure to a double phonological code during childhood—affects the visual processes involved in non-native phoneme identification in audiovisual speech perception. We conducted a phoneme identifi...
Ito, Takayuki; Gracco, Vincent; Ostry, David J.
Speech perception is known to rely on both auditory and visual information. However, sound specific somatosensory input has been shown also to influence speech perceptual processing (Ito et al., 2009). In the present study we addressed further the relationship between somatosensory information and speech perceptual processing by addressing the hypothesis that the temporal relationship between orofacial movement and sound processing contributes to somatosensory-auditory interaction in speech p...
Full Text Available One of the main issues within the field of social robotics is to endow robots with the ability to direct attention to people with whom they are interacting. Different approaches follow bio-inspired mechanisms, merging audio and visual cues to localize a person using multiple sensors. However, most of these fusion mechanisms have been used in fixed systems, such as those used in video-conference rooms, and thus, they may incur difficulties when constrained to the sensors with which a robot can be equipped. Besides, within the scope of interactive autonomous robots, there is a lack in terms of evaluating the benefits of audio-visual attention mechanisms, compared to only audio or visual approaches, in real scenarios. Most of the tests conducted have been within controlled environments, at short distances and/or with off-line performance measurements. With the goal of demonstrating the benefit of fusing sensory information with a Bayes inference for interactive robotics, this paper presents a system for localizing a person by processing visual and audio data. Moreover, the performance of this system is evaluated and compared via considering the technical limitations of unimodal systems. The experiments show the promise of the proposed approach for the proactive detection and tracking of speakers in a human-robot interactive framework.
Viciana-Abad, Raquel; Marfil, Rebeca; Perez-Lorenzo, Jose M; Bandera, Juan P; Romero-Garces, Adrian; Reche-Lopez, Pedro
One of the main issues within the field of social robotics is to endow robots with the ability to direct attention to people with whom they are interacting. Different approaches follow bio-inspired mechanisms, merging audio and visual cues to localize a person using multiple sensors. However, most of these fusion mechanisms have been used in fixed systems, such as those used in video-conference rooms, and thus, they may incur difficulties when constrained to the sensors with which a robot can be equipped. Besides, within the scope of interactive autonomous robots, there is a lack in terms of evaluating the benefits of audio-visual attention mechanisms, compared to only audio or visual approaches, in real scenarios. Most of the tests conducted have been within controlled environments, at short distances and/or with off-line performance measurements. With the goal of demonstrating the benefit of fusing sensory information with a Bayes inference for interactive robotics, this paper presents a system for localizing a person by processing visual and audio data. Moreover, the performance of this system is evaluated and compared via considering the technical limitations of unimodal systems. The experiments show the promise of the proposed approach for the proactive detection and tracking of speakers in a human-robot interactive framework. PMID:24878593
Martínez-Montes, Eduardo; Hernández-Pérez, Heivet; Chobert, Julie; Morgado-Rodríguez, Lisbet; Suárez-Murias, Carlos; Valdés-Sosa, Pedro A.; Besson, Mireille
The aim of this experiment was to investigate the influence of musical expertise on the automatic perception of foreign syllables and harmonic sounds. Participants were Cuban students with high level of expertise in music or in visual arts and with the same level of general education and socio-economic background. We used a multi-feature Mismatch Negativity (MMN) design with sequences of either syllables in Mandarin Chinese or harmonic sounds, both comprising deviants in pitch contour, duration and Voice Onset Time (VOT) or equivalent that were either far from (Large deviants) or close to (Small deviants) the standard. For both Mandarin syllables and harmonic sounds, results were clear-cut in showing larger MMNs to pitch contour deviants in musicians than in visual artists. Results were less clear for duration and VOT deviants, possibly because of the specific characteristics of the stimuli. Results are interpreted as reflecting similar processing of pitch contour in speech and non-speech sounds. The implications of these results for understanding the influence of intense musical training from childhood to adulthood and of genetic predispositions for music on foreign language perception are discussed. PMID:24294193
Full Text Available The aim of this experiment was to investigate the influence of musical expertise on the automatic perception of foreign syllables and harmonic sounds. Participants were Cuban students with high level of expertise in music or in visual arts and with the same level of general education and socio-economic background. We used a multi-feature Mismatch Negativity (MMN design with sequences of either syllables in Mandarin Chinese or harmonic sounds, both comprising deviants in pitch contour, duration and Voice Onset Time (VOT or equivalent that were either far from (Large deviants or close to (Small deviants the standard. For both Mandarin syllables and harmonic sounds, results were clear-cut in showing larger MMNs to pitch contour deviants in musicians than in visual artists. Results were less clear for duration and VOT deviants, possibly because of the specific characteristics of the stimuli. Results are interpreted as reflecting similar processing of pitch contour in speech and non-speech sounds. The implications of these results for understanding the influence of intense musical training from childhood to adulthood and of genetic predispositions for music on foreign language perception is discussed.
Bertoncini, J; Cabrera, L
The development of speech perception relies upon early auditory capacities (i.e. discrimination, segmentation and representation). Infants are able to discriminate most of the phonetic contrasts occurring in natural languages, and at the end of the first year, this universal ability starts to narrow down to the contrasts used in the environmental language. During the second year, this specialization is characterized by the development of comprehension, lexical organization and word production. That process appears now as the result of multiple interactions between perceptual, cognitive and social developing abilities. Distinct factors like word acquisition, sensitivity to the statistical properties of the input, or even the nature of the social interactions, might play a role at one time or another during the acquisition of phonological patterns. Experience with the native language is necessary for phonetic segments to be functional units of perception and for speech sound representations (words, syllables) to be more specified and phonetically organized. This evolution goes on beyond 24 months of age in a learning context characterized from the early stages by the interaction with other developing (linguistic and non-linguistic) capacities. PMID:25218761
Foundations of Voice and Speech Quality Perception starts out with the fundamental question of: "How do listeners perceive voice and speech quality and how can these processes be modeled?" Any quantitative answers require measurements. This is natural for physical quantities but harder to imagine for perceptual measurands. This book approaches the problem by actually identifying major perceptual dimensions of voice and speech quality perception, defining units wherever possible and offering paradigms to position these dimensions into a structural skeleton of perceptual speech and voice quality. The emphasis is placed on voice and speech quality assessment of systems in artificial scenarios. Many scientific fields are involved. This book bridges the gap between two quite diverse fields, engineering and humanities, and establishes the new research area of Voice and Speech Quality Perception.
Shahin, Antoine J
Does musical training affect our perception of speech? For example, does learning to play a musical instrument modify the neural circuitry for auditory processing in a way that improves one's ability to perceive speech more clearly in noisy environments? If so, can speech perception in individuals with hearing loss (HL), who struggle in noisy situations, benefit from musical training? While music and speech exhibit some specialization in neural processing, there is evidence suggesting that skills acquired through musical training for specific acoustical processes may transfer to, and thereby improve, speech perception. The neurophysiological mechanisms underlying the influence of musical training on speech processing and the extent of this influence remains a rich area to be explored. A prerequisite for such transfer is the facilitation of greater neurophysiological overlap between speech and music processing following musical training. This review first establishes a neurophysiological link between musical training and speech perception, and subsequently provides further hypotheses on the neurophysiological implications of musical training on speech perception in adverse acoustical environments and in individuals with HL. PMID:21716639
Antoine J Shahin
Full Text Available Does musical training affect our perception of speech? For example, does learning to play a musical instrument modify the neural circuitry for auditory processing in a way that improves one’s ability to perceive speech more clearly in noisy environments? If so, can speech perception in individuals with hearing loss, who struggle in noisy situations, benefit from musical training? While music and speech exhibit some specialization in neural processing, there is evidence suggesting that skills acquired through musical training for specific acoustical processes may transfer to, and thereby improve, speech perception. The neurophysiological mechanisms underlying the influence of musical training on speech processing and the extent of this influence remains a rich area to be explored. A prerequisite for such transfer is the facilitation of greater neurophysiological overlap between speech and music processing following musical training. This review first establishes a neurophysiological link between musical training and speech perception, and subsequently provides further hypotheses on the neurophysiological implications of musical training on speech perception in adverse acoustical environments and in individuals with hearing loss.
Venezia, Jonathan H; Fillmore, Paul; Matchin, William; Isenberg, A Lisette; Hickok, Gregory; Fridriksson, Julius
Sensory information is critical for movement control, both for defining the targets of actions and providing feedback during planning or ongoing movements. This holds for speech motor control as well, where both auditory and somatosensory information have been shown to play a key role. Recent clinical research demonstrates that individuals with severe speech production deficits can show a dramatic improvement in fluency during online mimicking of an audiovisual speech signal suggesting the existence of a visuomotor pathway for speech motor control. Here we used fMRI in healthy individuals to identify this new visuomotor circuit for speech production. Participants were asked to perceive and covertly rehearse nonsense syllable sequences presented auditorily, visually, or audiovisually. The motor act of rehearsal, which is prima facie the same whether or not it is cued with a visible talker, produced different patterns of sensorimotor activation when cued by visual or audiovisual speech (relative to auditory speech). In particular, a network of brain regions including the left posterior middle temporal gyrus and several frontoparietal sensorimotor areas activated more strongly during rehearsal cued by a visible talker versus rehearsal cued by auditory speech alone. Some of these brain regions responded exclusively to rehearsal cued by visual or audiovisual speech. This result has significant implications for models of speech motor control, for the treatment of speech output disorders, and for models of the role of speech gesture imitation in development. PMID:26608242
Lu, Chunming; Long, Yuhang; Zheng, Lifen; Shi, Guang; Liu, Li; Ding, Guosheng; Howell, Peter
Speech production difficulties are apparent in people who stutter (PWS). PWS also have difficulties in speech perception compared to controls. It is unclear whether the speech perception difficulties in PWS are independent of, or related to, their speech production difficulties. To investigate this issue, functional MRI data were collected on 13 PWS and 13 controls whilst the participants performed a speech production task and a speech perception task. PWS performed poorer than controls in the perception task and the poorer performance was associated with a functional activity difference in the left anterior insula (part of the speech motor area) compared to controls. PWS also showed a functional activity difference in this and the surrounding area [left inferior frontal cortex (IFC)/anterior insula] in the production task compared to controls. Conjunction analysis showed that the functional activity differences between PWS and controls in the left IFC/anterior insula coincided across the perception and production tasks. Furthermore, Granger Causality Analysis on the resting-state fMRI data of the participants showed that the causal connection from the left IFC/anterior insula to an area in the left primary auditory cortex (Heschl’s gyrus) differed significantly between PWS and controls. The strength of this connection correlated significantly with performance in the perception task. These results suggest that speech perception difficulties in PWS are associated with anomalous functional activity in the speech motor area, and the altered functional connectivity from this area to the auditory area plays a role in the speech perception difficulties of PWS.
Cabbage, Kathryn L; Farquharson, Kelly; Hogan, Tiffany P
Some children with residual deficits in speech production also display characteristics of dyslexia; however, the causes of these disorders--in isolation or comorbidly--remain unknown. Presently, the role of phonological representations is an important construct for considering how the underlying system of phonology functions. In particular, two related skills--speech perception and phonological working memory--may provide insight into the nature of phonological representations. This study provides an exploratory investigation into the profiles of three 9-year-old children: one with residual speech errors, one with residual speech errors and dyslexia, and one who demonstrated typical, age-appropriate speech sound production and reading skills. We provide an in-depth examination of their relative abilities in the areas of speech perception, phonological working memory, vocabulary, and word reading. Based on these preliminary explorations, we suggest implications for the assessment and treatment of children with residual speech errors and/or dyslexia. PMID:26458199
Kaganovich, Natalya; Schumaker, Jennifer
Sensitivity to the temporal relationship between auditory and visual stimuli is key to efficient audiovisual integration. However, even adults vary greatly in their ability to detect audiovisual temporal asynchrony. What underlies this variability is currently unknown. We recorded event-related potentials (ERPs) while participants performed a simultaneity judgment task on a range of audiovisual (AV) and visual-auditory (VA) stimulus onset asynchronies (SOAs) and compared ERP responses in good and poor performers to the 200ms SOA, which showed the largest individual variability in the number of synchronous perceptions. Analysis of ERPs to the VA200 stimulus yielded no significant results. However, those individuals who were more sensitive to the AV200 SOA had significantly more positive voltage between 210 and 270ms following the sound onset. In a follow-up analysis, we showed that the mean voltage within this window predicted approximately 36% of variability in sensitivity to AV temporal asynchrony in a larger group of participants. The relationship between the ERP measure in the 210-270ms window and accuracy on the simultaneity judgment task also held for two other AV SOAs with significant individual variability -100 and 300ms. Because the identified window was time-locked to the onset of sound in the AV stimulus, we conclude that sensitivity to AV temporal asynchrony is shaped to a large extent by the efficiency in the neural encoding of sound onsets. PMID:27094850
Full Text Available Congenital amusics, or tone-deaf individuals, show difficulty in perceiving and producing small pitch differences. While amusia has marked effects on music perception, its impact on speech perception is less clear. Here we test the hypothesis that individual differences in pitch perception affect judgment of emotion in speech, by applying band-pass filters to spoken statements of emotional speech. A norming study was first conducted on Mechanical Turk to ensure that the intended emotions from the Macquarie Battery for Evaluation of Prosody (MBEP were reliably identifiable by US English speakers. The most reliably identified emotional speech samples were used in in Experiment 1, in which subjects performed a psychophysical pitch discrimination task, and an emotion identification task under band-pass and unfiltered speech conditions. Results showed a significant correlation between pitch discrimination threshold and emotion identification accuracy for band-pass filtered speech, with amusics (defined here as those with a pitch discrimination threshold > 16 Hz performing worse than controls. This relationship with pitch discrimination was not seen in unfiltered speech conditions. Given the dissociation between band-pass filtered and unfiltered speech conditions, we inferred that amusics may be compensating for poorer pitch perception by using speech cues that are filtered out in this manipulation.
Kushnerenko, Elena; Tomalski, Przemyslaw; Ballieux, Haiko; Potton, Anita; Birtles, Deidre; Frostick, Caroline; Moore, Derek G.
The use of visual cues during the processing of audiovisual (AV) speech is known to be less efficient in children and adults with language difficulties and difficulties are known to be more prevalent in children from low-income populations. In the present study, we followed an economically diverse group of thirty-seven infants longitudinally from 6–9 months to 14–16 months of age. We used eye-tracking to examine whether individual differences in visual attention during AV processing of speech in 6–9 month old infants, particularly when processing congruent and incongruent auditory and visual speech cues, might be indicative of their later language development. Twenty-two of these 6–9 month old infants also participated in an event-related potential (ERP) AV task within the same experimental session. Language development was then followed-up at the age of 14–16 months, using two measures of language development, the Preschool Language Scale and the Oxford Communicative Development Inventory. The results show that those infants who were less efficient in auditory speech processing at the age of 6–9 months had lower receptive language scores at 14–16 months. A correlational analysis revealed that the pattern of face scanning and ERP responses to audiovisually incongruent stimuli at 6–9 months were both significantly associated with language development at 14–16 months. These findings add to the understanding of individual differences in neural signatures of AV processing and associated looking behavior in infants. PMID:23882240
Lolli, Sydney L.; Lewenstein, Ari D.; Basurto, Julian; Winnik, Sean; Loui, Psyche
Congenital amusics, or “tone-deaf” individuals, show difficulty in perceiving and producing small pitch differences. While amusia has marked effects on music perception, its impact on speech perception is less clear. Here we test the hypothesis that individual differences in pitch perception affect judgment of emotion in speech, by applying low-pass filters to spoken statements of emotional speech. A norming study was first conducted on Mechanical Turk to ensure that the intended emotions from the Macquarie Battery for Evaluation of Prosody were reliably identifiable by US English speakers. The most reliably identified emotional speech samples were used in Experiment 1, in which subjects performed a psychophysical pitch discrimination task, and an emotion identification task under low-pass and unfiltered speech conditions. Results showed a significant correlation between pitch-discrimination threshold and emotion identification accuracy for low-pass filtered speech, with amusics (defined here as those with a pitch discrimination threshold >16 Hz) performing worse than controls. This relationship with pitch discrimination was not seen in unfiltered speech conditions. Given the dissociation between low-pass filtered and unfiltered speech conditions, we inferred that amusics may be compensating for poorer pitch perception by using speech cues that are filtered out in this manipulation. To assess this potential compensation, Experiment 2 was conducted using high-pass filtered speech samples intended to isolate non-pitch cues. No significant correlation was found between pitch discrimination and emotion identification accuracy for high-pass filtered speech. Results from these experiments suggest an influence of low frequency information in identifying emotional content of speech. PMID:26441718
Lolli, Sydney L; Lewenstein, Ari D; Basurto, Julian; Winnik, Sean; Loui, Psyche
Congenital amusics, or "tone-deaf" individuals, show difficulty in perceiving and producing small pitch differences. While amusia has marked effects on music perception, its impact on speech perception is less clear. Here we test the hypothesis that individual differences in pitch perception affect judgment of emotion in speech, by applying low-pass filters to spoken statements of emotional speech. A norming study was first conducted on Mechanical Turk to ensure that the intended emotions from the Macquarie Battery for Evaluation of Prosody were reliably identifiable by US English speakers. The most reliably identified emotional speech samples were used in Experiment 1, in which subjects performed a psychophysical pitch discrimination task, and an emotion identification task under low-pass and unfiltered speech conditions. Results showed a significant correlation between pitch-discrimination threshold and emotion identification accuracy for low-pass filtered speech, with amusics (defined here as those with a pitch discrimination threshold >16 Hz) performing worse than controls. This relationship with pitch discrimination was not seen in unfiltered speech conditions. Given the dissociation between low-pass filtered and unfiltered speech conditions, we inferred that amusics may be compensating for poorer pitch perception by using speech cues that are filtered out in this manipulation. To assess this potential compensation, Experiment 2 was conducted using high-pass filtered speech samples intended to isolate non-pitch cues. No significant correlation was found between pitch discrimination and emotion identification accuracy for high-pass filtered speech. Results from these experiments suggest an influence of low frequency information in identifying emotional content of speech. PMID:26441718
Full Text Available Audio-visual speech recognition (AVSR using acoustic and visual signals of speech have received attention recently because of its robustness in noisy environments. Perceptual studies also support this approach by emphasizing the importance of visual information for speech recognition in humans. An important issue in decision fusion based AVSR system is how to obtain the appropriate integration weight for the speech modalities to integrate and ensure the combined AVSR system’s performances better than that of the audio-only and visual-only systems under various noise conditions. To solve this issue, we present a genetic algorithm (GA based optimization scheme to obtain the appropriate integration weight from the relative reliability of each modality. The performance of the proposed GA optimized reliability-ratio based weight estimation scheme is demonstrated via single speaker, mobile functions isolated word recognition experiments. The results show that the proposed scheme improves robust recognition accuracy over the conventional unimodal systems and the baseline reliability ratio-based AVSR system under various signal to noise ratio conditions.
Biau, Emmanuel; Soto-Faraco, Salvador
Spontaneous beat gestures are an integral part of the paralinguistic context during face-to-face conversations. Here we investigated the time course of beat-speech integration in speech perception by measuring ERPs evoked by words pronounced with or without an accompanying beat gesture, while participants watched a spoken discourse. Words…
Aubanel, Vincent; Davis, Chris; Kim, Jeesun
A growing body of evidence shows that brain oscillations track speech. This mechanism is thought to maximize processing efficiency by allocating resources to important speech information, effectively parsing speech into units of appropriate granularity for further decoding. However, some aspects of this mechanism remain unclear. First, while periodicity is an intrinsic property of this physiological mechanism, speech is only quasi-periodic, so it is not clear whether periodicity would present an advantage in processing. Second, it is still a matter of debate which aspect of speech triggers or maintains cortical entrainment, from bottom-up cues such as fluctuations of the amplitude envelope of speech to higher level linguistic cues such as syntactic structure. We present data from a behavioral experiment assessing the effect of isochronous retiming of speech on speech perception in noise. Two types of anchor points were defined for retiming speech, namely syllable onsets and amplitude envelope peaks. For each anchor point type, retiming was implemented at two hierarchical levels, a slow time scale around 2.5 Hz and a fast time scale around 4 Hz. Results show that while any temporal distortion resulted in reduced speech intelligibility, isochronous speech anchored to P-centers (approximated by stressed syllable vowel onsets) was significantly more intelligible than a matched anisochronous retiming, suggesting a facilitative role of periodicity defined on linguistically motivated units in processing speech in noise.
BU Fanliang; CHEN Yanpu
As the human ear is dull to the phase in speech, little attention has been paid tophase information in speech coding. In fact, the speech perceptual quality may be degeneratedif the phase distortion is very large. The perceptual effect of the STFT (Short time Fouriertransform) phase spectrum is studied by auditory subjective hearing tests. Three main con-clusions are (1) If the phase information is neglected completely, the subjective quality of thereconstructed speech may be very poor; (2) Whether the neglected phase is in low frequencyband or high frequency band, the difference from the original speech can be perceived by ear;(3) It is very difficult for the human ear to perceive the difference of speech quality betweenoriginal speech and reconstructed speech while the phase quantization step size is shorter thanπ/7.
José Antonio Palao Errando
Full Text Available En la llamada «sociedad de la información» los estudios sobre cine se han visto diluidos en el abordaje pragmático y tecnológico del discurso audiovisual, así como la propia fruición del cine se ha visto atrapada en la red del DVD y del hipertexto. El propio cine reacciona ante ello a través de estructuras narrativas complejas que lo alejan del discurso audiovisual estándar. La función de los estudios sobre cine y de su enseñanza universitaria debe ser la reintroducción del sujeto rechazado del saber informativo por medio de la interpretación del texto fílmico. In the so called «information society», film studies have been diluted in the pragmatic and technological approaching of the audiovisual speech, as well as the own fruition of the cinema has been caught in the net of DVD and hypertext. The cinema itself reacts in the face of it through complex narrative structures that take it away from the standard audio-visual speech. The function of film studies at the university education should be the reintroduction of the rejected subject of the informative knowledge by means of the interpretation of film text.
This book interconnects two essential disciplines to study the perception of speech: Neuroscience and Quality of Experience, which to date have rarely been used together for the purposes of research on speech quality perception. In five key experiments, the book demonstrates the application of standard clinical methods in neurophysiology on the one hand, and of methods used in fields of research concerned with speech quality perception on the other. Using this combination, the book shows that speech stimuli with different lengths and different quality impairments are accompanied by physiological reactions related to quality variations, e.g., a positive peak in an event-related potential. Furthermore, it demonstrates that – in most cases – quality impairment intensity has an impact on the intensity of physiological reactions.
Andersen, Tobias S.
Information processing in the sensory modalities is not segregated but interacts strongly. The exact nature of this interaction is not known and might differ for different multisensory phenomena. Here, we investigate two cases of categorical audiovisual perception: speech perception and the perception of rapid flashes and beeps. It is known that multisensory interactions in general depend on physical factors, such as information reliability and modality appropriateness, but it is not know...
Pisoni, David B.
Over the past few years, there has been increased interest in studying some of the cognitive factors that affect speech perception performance of cochlear implant patients. In this paper, I provide a brief theoretical overview of the fundamental assumptions of the information-processing approach to cognition and discuss the role of perception, learning, and memory in speech perception and spoken language processing. The information-processing framework provides researchers and clinicians with...
Wagner, Petra; Windmann, Andreas
Time shrinking denotes the psycho-acoustic phenomenon that an acoustic event is perceived as shorter if it follows an even shorter acoustic event. Previous work has shown that time shrinking can be traced in speech-like phrases and may lead to the impression of a higher speech rate and syllable isochrony. This paper provides experimental evidence that time shrinking is effective on foot level as well as phrase level. Some examples from natural speech are given, where time shrinking effe...
Full Text Available This fMRI study examines shared and distinct cortical areas involved in the auditory perception of song and speech at the level of their underlying constituents: words, pitch and rhythm. Univariate and multivariate analyses were performed on the brain activity patterns of six conditions, arranged in a subtractive hierarchy: sung sentences including words, pitch and rhythm; hummed speech prosody and song melody containing only pitch patterns and rhythm; as well as the pure musical or speech rhythm.Systematic contrasts between these balanced conditions following their hierarchical organization showed a great overlap between song and speech at all levels in the bilateral temporal lobe, but suggested a differential role of the inferior frontal gyrus (IFG and intraparietal sulcus (IPS in processing song and speech. The left IFG was involved in word- and pitch-related processing in speech, the right IFG in processing pitch in song.Furthermore, the IPS showed sensitivity to discrete pitch relations in song as opposed to the gliding pitch in speech. Finally, the superior temporal gyrus and premotor cortex coded for general differences between words and pitch patterns, irrespective of whether they were sung or spoken. Thus, song and speech share many features which are reflected in a fundamental similarity of brain areas involved in their perception. However, fine-grained acoustic differences on word and pitch level are reflected in the activity of IFG and IPS.
Stacey, Paula C; Kitterick, Pádraig T; Morris, Saffron D; Sumner, Christian J
Understanding what is said in demanding listening situations is assisted greatly by looking at the face of a talker. Previous studies have observed that normal-hearing listeners can benefit from this visual information when a talker's voice is presented in background noise. These benefits have also been observed in quiet listening conditions in cochlear-implant users, whose device does not convey the informative temporal fine structure cues in speech, and when normal-hearing individuals listen to speech processed to remove these informative temporal fine structure cues. The current study (1) characterised the benefits of visual information when listening in background noise; and (2) used sine-wave vocoding to compare the size of the visual benefit when speech is presented with or without informative temporal fine structure. The accuracy with which normal-hearing individuals reported words in spoken sentences was assessed across three experiments. The availability of visual information and informative temporal fine structure cues was varied within and across the experiments. The results showed that visual benefit was observed using open- and closed-set tests of speech perception. The size of the benefit increased when informative temporal fine structure cues were removed. This finding suggests that visual information may play an important role in the ability of cochlear-implant users to understand speech in many everyday situations. Models of audio-visual integration were able to account for the additional benefit of visual information when speech was degraded and suggested that auditory and visual information was being integrated in a similar way in all conditions. The modelling results were consistent with the notion that audio-visual benefit is derived from the optimal combination of auditory and visual sensory cues. PMID:27085797
Coady, Jeffry A; Kluender, Keith R; Evans, Julia L
Previous research has suggested that children with specific language impairments (SLI) have deficits in basic speech perception abilities, and this may be an underlying source of their linguistic deficits. These findings have come from studies in which perception of synthetic versions of meaningless syllables was typically examined in tasks with high memory demands. In this study, 20 children with SLI (mean age = 9 years, 3 months) and 20 age-matched peers participated in a categorical perception task. Children identified and discriminated digitally edited versions of naturally spoken real words in tasks designed to minimize memory requirements. Both groups exhibited all hallmarks of categorical perception: a sharp labeling function, discontinuous discrimination performance, and discrimination predicted from identification. There were no group differences for identification data, but children with SLI showed lower peak discrimination values. Children with SLI still discriminated phonemically contrastive pairs at levels significantly better than chance, with discrimination of same-label pairs at chance. These data suggest that children with SLI perceive natural speech tokens comparably to age-matched controls when listening to words under conditions that minimize memory load. Further, poor performance on speech perception tasks may not be due to a speech perception deficit, but rather to a consequence of task demands. PMID:16378484
Biau, Emmanuel; Soto-Faraco, Salvador, 1970-
Spontaneous beat gestures are an integral part of the paralinguistic context during face-to-face conversations. Here we investigated the time course of beat-speech integration in speech perception by measuring ERPs evoked by words pronounced with or without an accompanying beat gesture, while participants watched a spoken discourse. Words accompanied by beats elicited a positive shift in ERPs at an early sensory stage (before 100 ms) and at a later time window coinciding with the auditory com...
Vlčková-Mejvaldová, J.; Horák, Petr
Berlin: Springer, 2011 - (Travieso-González, C.; Alonso-Hernández, J.), s. 170-176 ISBN 978-3-642-25019-4. ISSN 0302-9743. [5th International Conference on Nonlinear Speech Processing (NOLISP 2011). Las Palmas de Gran Canaria (ES), 07.11.2011-09.11.2011] Institutional research plan: CEZ:AV0Z20670512 Keywords : emotions * perception tests * speech synthesis Subject RIV: JA - Electronics ; Optoelectronics, Electrical Engineering
Anderson, Melinda C; Arehart, Kathryn H; Kates, James M
Speech perception depends on access to spectral and temporal acoustic cues. Temporal cues include slowly varying amplitude changes (i.e. temporal envelope, TE) and quickly varying amplitude changes associated with the center frequency of the auditory filter (i.e. temporal fine structure, TFS). This study quantifies the effects of TFS randomization through noise vocoding on the perception of speech quality by parametrically varying the amount of original TFS available above 1500Hz. The two research aims were: 1) to establish the role of TFS in quality perception, and 2) to determine if the role of TFS in quality perception differs between subjects with normal hearing and subjects with sensorineural hearing loss. Ratings were obtained from 20 subjects (10 with normal hearing and 10 with hearing loss) using an 11-point quality scale. Stimuli were processed in three different ways: 1) A 32-channel noise-excited vocoder with random envelope fluctuations in the noise carrier, 2) a 32-channel noise-excited vocoder with the noise-carrier envelope smoothed, and 3) removal of high-frequency bands. Stimuli were presented in quiet and in babble noise at 18dB and 12dB signal-to-noise ratios. TFS randomization had a measurable detrimental effect on quality ratings for speech in quiet and a smaller effect for speech in background babble. Subjects with normal hearing and subjects with sensorineural hearing loss provided similar quality ratings for noise-vocoded speech. PMID:24333929
Lametti, Daniel R.; Rochet-Capellan, Amélie; Neufeld, Emily; Shiller, Douglas M.; Ostry, David J.
Recent studies of human speech motor learning suggest that learning is accompanied by changes in auditory perception. But what drives the perceptual change? Is it a consequence of changes in the motor system? Or is it a result of sensory inflow during learning? Here, subjects participated in a speech motor-learning task involving adaptation to altered auditory feedback and they were subsequently tested for perceptual change. In two separate experiments, involving two different auditory percep...
Messaoud-Galusi, Souhila; Hazan, Valerie; Rosen, Stuart
Purpose: The claim that speech perception abilities are impaired in dyslexia was investigated in a group of 62 children with dyslexia and 51 average readers matched in age. Method: To test whether there was robust evidence of speech perception deficits in children with dyslexia, speech perception in noise and quiet was measured using 8 different…
Environmental statistics are known to be important factors shaping our perceptual system. The visual and auditory systems have evolved to be effcient for processing natural images or speech. The com- mon characteristics between natural images and speech are that they are both highly structured, therefore having much redundancy. Our perceptual system may use redundancy reduction and sparse coding strategies to deal with complex stimuli every day. Both redundancy reduction ...
［1］Richard, P., Schumeyer, Kenneth E. B., The effect of visual information on word initial consonant perception of dysarthric speech, in Proc. ICSLP'96 October 3-6 1996, Philadephia, Pennsylvania, USA.［2］Goff, B. L., Marigny, T. G., Benoit, C., Read my lips...and my jaw! How intelligible are the components of a speaker's face? Eurospeech'95, 4th European Conference on Speech Communication and Technology, Madrid, September 1995.［3］McGurk, H., MacDonald, J. Hearing lips and seeing voices, Nature, 1976, 264: 746.［4］Duran A. F., Mcgurk effect in Spanish and German listeners: Influences of visual cues in the perception of Spanish and German confliction audio-visual stimuli, Eurospeech'95. 4th European Conference on Speech Communication and Technology, Madrid, September 1995.［5］Luettin, J., Visual speech and speaker recognition, Ph.D thesis, University of Sheffield, 1997.［6］Xu Yanjun, Du Limin, Chinese audiovisual bimodal speech database CAVSR1.0, Chinese Journal of Acoustics, to appear.［7］Zhang Jialu, Speech corpora and language input/output methods' evaluation, Chinese Applied Acoustics, 1994, 13(3): 5.
Slavova, Velina; Verhelst, Werner; Sahli, Hichem
In this report we summarize the state-of-the-art of speech emotion recognition from the signal processing point of view. On the bases of multi-corporal experiments with machine-learning classifiers, the observation is made that existing approaches for supervised machine learning lead to database dependent classifiers which can not be applied for multi-language speech emotion recognition without additional training because they discriminate the emotion classes following the use...
Jin, Yu; Díaz, Begoña; Colomer, Marc; Sebastián Gallés, Núria
Individual differences in second language (L2) phoneme perception (within the normal population) have been related to speech perception abilities, also observed in the native language, in studies assessing the electrophysiological response mismatch negativity (MMN). Here, we investigate the brain oscillatory dynamics in the theta band, the spectral correlate of the MMN, that underpin success in phoneme learning. Using previous data obtained in an MMN paradigm, the dynamics of cort...
Jiang, Jintao; Bernstein, Lynne E.
When the auditory and visual components of spoken audiovisual nonsense syllables are mismatched, perceivers produce four different types of perceptual responses, auditory correct, visual correct, fusion (the so-called McGurk effect), and combination (i.e., two consonants are reported). Here, quantitative measures were developed to account for the distribution of types of perceptual responses to 384 different stimuli from four talkers. The measures included mutual information, the presented acoustic signal versus the acoustic signal recorded with the presented video, and the correlation between the presented acoustic and video stimuli. In Experiment 1, open-set perceptual responses were obtained for acoustic /bA/ or /lA/ dubbed to video /bA, dA, gA, vA, zA, lA, wA, ΔA/. The talker, the video syllable, and the acoustic syllable significantly influenced the type of response. In Experiment 2, the best predictors of response category proportions were a subset of the physical stimulus measures, with the variance accounted for in the perceptual response category proportions between 17% and 52%. That audiovisual stimulus relationships can account for response distributions supports the possibility that internal representations are based on modality-specific stimulus relationships. PMID:21574741
Nicholls, Michael E. R.; Searle, Dara A.
This study explored asymmetries for movement, expression and perception of visual speech. Sixteen dextral models were videoed as they articulated: "bat," "cat," "fat," and "sat." Measurements revealed that the right side of the mouth was opened wider and for a longer period than the left. The asymmetry was accentuated at the beginning and ends of…
Badino, Leonardo; D'Ausilio, Alessandro; Fadiga, Luciano; Metta, Giorgio
Action perception and recognition are core abilities fundamental for human social interaction. A parieto-frontal network (the mirror neuron system) matches visually presented biological motion information onto observers' motor representations. This process of matching the actions of others onto our own sensorimotor repertoire is thought to be important for action recognition, providing a non-mediated "motor perception" based on a bidirectional flow of information along the mirror parieto-frontal circuits. State-of-the-art machine learning strategies for hand action identification have shown better performances when sensorimotor data, as opposed to visual information only, are available during learning. As speech is a particular type of action (with acoustic targets), it is expected to activate a mirror neuron mechanism. Indeed, in speech perception, motor centers have been shown to be causally involved in the discrimination of speech sounds. In this paper, we review recent neurophysiological and machine learning-based studies showing (a) the specific contribution of the motor system to speech perception and (b) that automatic phone recognition is significantly improved when motor data are used during training of classifiers (as opposed to learning from purely auditory data). PMID:24935820
Boatman, Dana F.
Recent brain mapping studies have provided new insights into the cortical systems that mediate human speech perception. Electrocortical stimulation mapping (ESM) is a brain mapping method that is used clinically to localize cortical functions in neurosurgical patients. Recent ESM studies have yielded new insights into the cortical systems that…
Zhang, Juan; McBride-Chang, Catherine
While the importance of phonological sensitivity for understanding reading acquisition and impairment across orthographies is well documented, what underlies deficits in phonological sensitivity is not well understood. Some researchers have argued that speech perception underlies variability in phonological representations. Others have…
Rance, Gary; Fava, Rosanne; Baldock, Heath; Chong, April; Barker, Elizabeth; Corben, Louise; Delatycki
The aim of this study was to investigate auditory pathway function and speech perception ability in individuals with Friedreich ataxia (FRDA). Ten subjects confirmed by genetic testing as being homozygous for a GAA expansion in intron 1 of the FXN gene were included. While each of the subjects demonstrated normal, or near normal sound detection, 3…
Accounts of speech perception disagree on whether listeners perceive the acoustic signal (Diehl, Lotto, & Holt, 2004) or the vocal tract gestures that produce the signal (e.g., Fowler, 1986). In this dissertation, I outline a research program using a phenomenon called "perceptual compensation for coarticulation" (Mann, 1980) to examine this…
Fowler, Carol A; Shankweiler, Donald; Studdert-Kennedy, Michael
We revisit an article, "Perception of the Speech Code" (PSC), published in this journal 50 years ago (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967) and address one of its legacies concerning the status of phonetic segments, which persists in theories of speech today. In the perspective of PSC, segments both exist (in language as known) and do not exist (in articulation or the acoustic speech signal). Findings interpreted as showing that speech is not a sound alphabet, but, rather, phonemes are encoded in the signal, coupled with findings that listeners perceive articulation, led to the motor theory of speech perception, a highly controversial legacy of PSC. However, a second legacy, the paradoxical perspective on segments has been mostly unquestioned. We remove the paradox by offering an alternative supported by converging evidence that segments exist in language both as known and as used. We support the existence of segments in both language knowledge and in production by showing that phonetic segments are articulatory and dynamic and that coarticulation does not eliminate them. We show that segments leave an acoustic signature that listeners can track. This suggests that speech is well-adapted to public communication in facilitating, not creating a barrier to, exchange of language forms. PMID:26301536
ZHOU Li; NIE Yong-Wei
The paper tries to analyze speech perception in terms of its structure, process, levels and models. Some problems con⁃cerning speech perception have been touched upon. The paper aims at providing some reference for oral English teaching and learning in the light of speech perception. It is intended to arouse readers’reflection upon the effect of speech perception on oral English teaching.
Samira, Anderson; Bharath, Chandrasekaran; Han-Gyol, Yi; Nina, Kraus
Children are known to be particularly vulnerable to the effects of noise on speech perception, and it is commonly acknowledged that failure of central auditory processes can lead to these difficulties with speech-in-noise (SIN) perception. Still, little is known about the mechanistic relationship between central processes and the perception of speech in noise. Our aims were two-fold: to examine the effects of noise on the central encoding of speech through measurement of cortical event-relate...
Allard, Emily R.; Williams, Dale F.
Using semantic differential scales with nine trait pairs, 445 adults rated five audio-taped speech samples, one depicting an individual without a disorder and four portraying communication disorders. Statistical analyses indicated that the no disorder sample was rated higher with respect to the trait of employability than were the articulation,…
Wong, Patrick C. M.; Uppunda, Ajith K.; Parrish, Todd B.; Dhar, Sumitrajit
Purpose: The present study examines the brain basis of listening to spoken words in noise, which is a ubiquitous characteristic of communication, with the focus on the dorsal auditory pathway. Method: English-speaking young adults identified single words in 3 listening conditions while their hemodynamic response was measured using fMRI: speech in…
Genovese, E; Orzan, E; Turrini, M; Babighian, G; Arslan, E
Speech perception tests are an important part of procedures for diagnosing pre-verbal hearing loss. Merely establishing a child's hearing threshold with and without a hearing aid is not sufficient to ensure an adequate evaluation with a view to selecting cases suitable for cochlear implants because it fails to indicate the real benefit obtained from using a conventional hearing aid reliably. Speech perception tests have proved useful not only for patient selection, but also for subsequent evaluation of the efficacy of new hearing aids, such as tactile devices and cochlear implants. In clinical practice, the tests most commonly adopted with small children are: The Auditory Comprehension Test (ACT), Discrimination after Training (DAT), Monosyllable, Trochee, Spondee tests (MTS), Glendonald Auditory Screening Priocedure (GASP), Early Speech Perception Test (ESP), Rather than considering specific results achieved in individual cases, reference is generally made to the four speech perception classes proposed by Moog and Geers of the CID of St. Louis. The purpose of this classification, made on the results obtained with suitably differentiated tests according to the child's age and language ability, is to detect differences in perception of a spoken message in ideal listening conditions. To date, no italian language speech perception test has been designed to establish the assessment of speech perception level in children with profound hearing impairment. We attempted, therefore, to adapt the existing English tests to the Italian language taking into consideration the differences between the two languages. Our attention focused on the ESP test since it can be applied to even very small children (2 years old). The ESP is proposed in a standard version for hearing-impaired children over the age of 6 years and in a simplified version for younger children. The rationale we used for selecting Italian words reflect the rationale established for the original version, but the
Full Text Available Audiovisual translation is now a well-established sub-discipline of Translation Studies (TS: a position that it has reached over the last twenty years or so. Italian scholars and professionals in the field have made a substantial contribution to this successful development, a brief overview of which will be given in the first part of this article, inevitably concentrating on dubbing in the Italian context. Special attention will be devoted to the question of target audience perception, an area where researchers in the University of Bologna at Forlì have excelled. The second part of the article applies the methodology followed by the above mentioned researchers in a case study of how Italian end users perceive the dubbed version of the British film The History Boys (2006, which contains a plethora of culture-specific verbal and visual references to the English education system. The aim of the study was to ascertain: a whether translation/adaptation allows the transmission in this admittedly constrained medium of all the intended culture-bound issues, only too well known to the source audience, and, if so, to what extent, and b whether the target audience respondents to the e-questionnaire used were aware that they were missing information. The linked, albeit controversial, issue of quality assessment will also be addressed.
Bazon, Aline Cristine; Mantello, Erika Barioni; Gonçales, Alina Sanches; Isaac, Myriam de Lima; Hyppolito, Miguel Angelo; Reis, Ana Cláudia Mirândola Barbosa
Introduction The objective of the evaluation of auditory perception of cochlear implant users is to determine how the acoustic signal is processed, leading to the recognition and understanding of sound. Objective To investigate the differences in the process of auditory speech perception in individuals with postlingual hearing loss wearing a cochlear implant, using two different speech coding strategies, and to analyze speech perception and handicap perception in relation to the strategy us...
Stephens, Joseph D.; Holt, Lori L.
Data from Japanese quail suggest that the effect of preceding liquids (/l/ or /r/) on response to subsequent stops (/g/ or /d/) arises from general auditory processes sensitive to the spectral structure of sound [A. J. Lotto, K. R. Kluender, and L. L. Holt, J. Acoust. Soc. Am. 102, 1134-1140 (1997)]. If spectral content is key, appropriate nonspeech sounds should influence perception of speech sounds and vice versa. The former effect has been demonstrated [A. J. Lotto and K. R. Kluender, Percept. Psychophys. 60, 602-619 (1998)]. The current experiment investigated the influence of speech on the perception of nonspeech sounds. Nonspeech stimuli were 80-ms chirps modeled after the F2 and F3 transitions in /ga/ and /da/. F3 onset was increased in equal steps from 1800 Hz (/ga/ analog) to 2700 Hz (/da/ analog) to create a ten-member series. During AX discrimination trials, listeners heard chirps that were three steps apart on the series. Each chirp was preceded by a synthesized /al/ or /ar/. Results showed context effects predicted from differences in spectral content between the syllables and chirps. These results are consistent with the hypothesis that spectral contrast influences context effects in speech perception. [Work supported by ONR, NOHR, and CNBC.
Schellenberg, E Glenn
Claims of beneficial side effects of music training are made for many different abilities, including verbal and visuospatial abilities, executive functions, working memory, IQ, and speech perception in particular. Such claims assume that music training causes the associations even though children who take music lessons are likely to differ from other children in music aptitude, which is associated with many aspects of speech perception. Music training in childhood is also associated with cognitive, personality, and demographic variables, and it is well established that IQ and personality are determined largely by genetics. Recent evidence also indicates that the role of genetics in music aptitude and music achievement is much larger than previously thought. In short, music training is an ideal model for the study of gene-environment interactions but far less appropriate as a model for the study of plasticity. Children seek out environments, including those with music lessons, that are consistent with their predispositions; such environments exaggerate preexisting individual differences. PMID:25773632
Nisreen Naji Al-Khawaldeh
Full Text Available This paper presents the findings of an empirical study which compares Jordanian and English native speakers’ perceptions about the speech act of thanking. The forty interviews conducted revealed some similarities but also of remarkable cross-cultural differences relating to the significance of thanking, the variables affecting it, and the appropriate linguistic and paralinguistic choices, as well as their impact on the interpretation of thanking behaviour. The most important theoretical finding is that the data, while consistent with many views found in the existing literature, do not support Brown and Levinson’s (1987 claim that thanking is a speech act which intrinsically threatens the speaker’s negative face because it involves overt acceptance of an imposition on the speaker. Rather, thanking should be viewed as a means of establishing and sustaining social relationships. The study findings suggest that cultural variation in thanking is due to the high degree of sensitivity of this speech act to the complex interplay of a range of social and contextual variables, and point to some promising directions for further research.Keywords: Linguistic Variation, Cross-Cultural Pragmatics, Speech Act of Thanking, Perceptions of Politeness
Nisreen Naji Al-Khawaldeh; Vladimir Žegarac
This paper presents the findings of an empirical study which compares Jordanian and English native speakers’ perceptions about the speech act of thanking. The forty interviews conducted revealed some similarities but also of remarkable cross-cultural differences relating to the significance of thanking, the variables affecting it, and the appropriate linguistic and paralinguistic choices, as well as their impact on the interpretation of thanking behaviour. The most important theoretical findi...
PatriciaKuhl; SamuTaulu; AlexisBosseler; ElinaPihko; JyrkiMäkelä
The development of speech perception shows a dramatic transition between infancy and adulthood. Between 6 and 12 months, infants' initial ability to discriminate all phonetic units across the world's languages narrows—native discrimination increases while non-native discrimination shows a steep decline. We used magnetoencephalography (MEG) to examine whether brain oscillations in the theta band (4–8 Hz), reflecting increases in attention and cognitive effort, would provide a neural measure of...
Gick, Bryan; Derrick, Donald
Visual information from a speaker’s face can enhance1 or interfere with2 accurate auditory perception. This integration of information across auditory and visual streams has been observed in functional imaging studies3,4, and has typically been attributed to the frequency and robustness with which perceivers jointly encounter event-specific information from these two modalities5. Adding the tactile modality has long been considered a crucial next step in understanding multisensory integration...
Full Text Available Listeners vary in their ability to understand speech in noisy environments. Hearing sensitivity, as measured by pure-tone audiometry, can only partly explain these results, and cognition has emerged as another key concept. Although cognition relates to speech perception, the exact nature of the relationship remains to be fully understood. This study investigates how different aspects of cognition, particularly working memory and attention, relate to speech intelligibility for various tests.Perceptual accuracy of speech perception represents just one aspect of functioning in a listening environment. Activity and participation limits imposed by hearing loss, in addition to the demands of a listening environment, are also important and may be better captured by self-report questionnaires. Understanding how speech perception relates to self-reported aspects of listening forms the second focus of the study.Forty-four listeners aged between 50-74 years with mild SNHL were tested on speech perception tests differing in complexity from low (phoneme discrimination in quiet, to medium (digit triplet perception in speech-shaped noise to high (sentence perception in modulated noise; cognitive tests of attention, memory, and nonverbal IQ; and self-report questionnaires of general health-related and hearing-specific quality of life.Hearing sensitivity and cognition related to intelligibility differently depending on the speech test: neither was important for phoneme discrimination, hearing sensitivity alone was important for digit triplet perception, and hearing and cognition together played a role in sentence perception. Self-reported aspects of auditory functioning were correlated with speech intelligibility to different degrees, with digit triplets in noise showing the richest pattern. The results suggest that intelligibility tests can vary in their auditory and cognitive demands and their sensitivity to the challenges that auditory environments pose on
Heinrich, Antje; Henshaw, Helen; Ferguson, Melanie A
Listeners vary in their ability to understand speech in noisy environments. Hearing sensitivity, as measured by pure-tone audiometry, can only partly explain these results, and cognition has emerged as another key concept. Although cognition relates to speech perception, the exact nature of the relationship remains to be fully understood. This study investigates how different aspects of cognition, particularly working memory and attention, relate to speech intelligibility for various tests. Perceptual accuracy of speech perception represents just one aspect of functioning in a listening environment. Activity and participation limits imposed by hearing loss, in addition to the demands of a listening environment, are also important and may be better captured by self-report questionnaires. Understanding how speech perception relates to self-reported aspects of listening forms the second focus of the study. Forty-four listeners aged between 50 and 74 years with mild sensorineural hearing loss were tested on speech perception tests differing in complexity from low (phoneme discrimination in quiet), to medium (digit triplet perception in speech-shaped noise) to high (sentence perception in modulated noise); cognitive tests of attention, memory, and non-verbal intelligence quotient; and self-report questionnaires of general health-related and hearing-specific quality of life. Hearing sensitivity and cognition related to intelligibility differently depending on the speech test: neither was important for phoneme discrimination, hearing sensitivity alone was important for digit triplet perception, and hearing and cognition together played a role in sentence perception. Self-reported aspects of auditory functioning were correlated with speech intelligibility to different degrees, with digit triplets in noise showing the richest pattern. The results suggest that intelligibility tests can vary in their auditory and cognitive demands and their sensitivity to the challenges that
Full Text Available In many natural audiovisual events (e.g., a clap of the two hands, the visual signal precedes the sound and thus allows observers to predict when, where, and which sound will occur. Previous studies have already reported that there are distinct neural correlates of temporal (when versus phonetic/semantic (which content on audiovisual integration. Here we examined the effect of visual prediction of auditory location (where in audiovisual biological motion stimuli by varying the spatial congruency between the auditory and visual part of the audiovisual stimulus. Visual stimuli were presented centrally, whereas auditory stimuli were presented either centrally or at 90° azimuth. Typical subadditive amplitude reductions (AV – V < A were found for the auditory N1 and P2 for spatially congruent and incongruent conditions. The new finding is that the N1 suppression was larger for spatially congruent stimuli. A very early audiovisual interaction was also found at 30-50 ms in the spatially congruent condition, while no effect of congruency was found on the suppression of the P2. This indicates that visual prediction of auditory location can be coded very early in auditory processing.
van de Rijt, Luuk P. H.; van Opstal, A. John; Mylanus, Emmanuel A. M.; Straatman, Louise V.; Hu, Hai Yin; Snik, Ad F. M.; van Wanrooij, Marc M.
Background: Speech understanding may rely not only on auditory, but also on visual information. Non-invasive functional neuroimaging techniques can expose the neural processes underlying the integration of multisensory processes required for speech understanding in humans. Nevertheless, noise (from functional MRI, fMRI) limits the usefulness in auditory experiments, and electromagnetic artifacts caused by electronic implants worn by subjects can severely distort the scans (EEG, fMRI). Therefore, we assessed audio-visual activation of temporal cortex with a silent, optical neuroimaging technique: functional near-infrared spectroscopy (fNIRS). Methods: We studied temporal cortical activation as represented by concentration changes of oxy- and deoxy-hemoglobin in four, easy-to-apply fNIRS optical channels of 33 normal-hearing adult subjects and five post-lingually deaf cochlear implant (CI) users in response to supra-threshold unisensory auditory and visual, as well as to congruent auditory-visual speech stimuli. Results: Activation effects were not visible from single fNIRS channels. However, by discounting physiological noise through reference channel subtraction (RCS), auditory, visual and audiovisual (AV) speech stimuli evoked concentration changes for all sensory modalities in both cohorts (p < 0.001). Auditory stimulation evoked larger concentration changes than visual stimuli (p < 0.001). A saturation effect was observed for the AV condition. Conclusions: Physiological, systemic noise can be removed from fNIRS signals by RCS. The observed multisensory enhancement of an auditory cortical channel can be plausibly described by a simple addition of the auditory and visual signals with saturation. PMID:26903848
Cristia, Alejandrina; Seidl, Amanda; Junge, Caroline; Soderstrom, Melanie; Hagoort, Peter
There are increasing reports that individual variation in behavioral and neurophysiological measures of infant speech processing predicts later language outcomes, and specifically concurrent or subsequent vocabulary size. If such findings are held up under scrutiny, they could both illuminate theoretical models of language development and contribute to the prediction of communicative disorders. A qualitative, systematic review of this emergent literature illustrated the variety of approaches that have been used and highlighted some conceptual problems regarding the measurements. A quantitative analysis of the same data established that the bivariate relation was significant, with correlations of similar strength to those found for well-established nonlinguistic predictors of language. Further exploration of infant speech perception predictors, particularly from a methodological perspective, is recommended. PMID:24320112
Bidelman, Gavin M
Neural oscillations have been linked to various perceptual and cognitive brain operations. Here, we examined the role of these induced brain responses in categorical speech perception (CP), a phenomenon in which similar features are mapped to discrete, common identities despite their equidistant/continuous physical spacing. We recorded neuroelectric activity while participants rapidly classified sounds along a vowel continuum (/u/ to /a/). Time-frequency analyses applied to the EEG revealed distinct temporal dynamics in induced (non-phase locked) oscillations; increased β (15-30Hz) coded prototypical vowel sounds carrying well-defined phonetic categories whereas increased γ (50-70Hz) accompanied ambiguous tokens near the categorical boundary. Notably, changes in β activity were strongly correlated with the slope of listeners' psychometric identification functions, a measure of the "steepness" of their categorical percept. Our findings demonstrate that in addition to previously observed evoked (phase-locked) correlates of CP, induced brain activity in the β-band codes the ambiguity and strength of categorical speech percepts. PMID:25540857
Heinrich, Antje; Henshaw, Helen; Ferguson, Melanie A.
Good speech perception and communication skills in everyday life are crucial for participation and well-being, and are therefore an overarching aim of auditory rehabilitation. Both behavioral and self-report measures can be used to assess these skills. However, correlations between behavioral and self-report speech perception measures are often low. One possible explanation is that there is a mismatch between the specific situations used in the assessment of these skills in each method, and a more careful matching across situations might improve consistency of results. The role that cognition plays in specific speech situations may also be important for understanding communication, as speech perception tests vary in their cognitive demands. In this study, the role of executive function, working memory (WM) and attention in behavioral and self-report measures of speech perception was investigated. Thirty existing hearing aid users with mild-to-moderate hearing loss aged between 50 and 74 years completed a behavioral test battery with speech perception tests ranging from phoneme discrimination in modulated noise (easy) to words in multi-talker babble (medium) and keyword perception in a carrier sentence against a distractor voice (difficult). In addition, a self-report measure of aided communication, residual disability from the Glasgow Hearing Aid Benefit Profile, was obtained. Correlations between speech perception tests and self-report measures were higher when specific speech situations across both were matched. Cognition correlated with behavioral speech perception test results but not with self-report. Only the most difficult speech perception test, keyword perception in a carrier sentence with a competing distractor voice, engaged executive functions in addition to WM. In conclusion, any relationship between behavioral and self-report speech perception is not mediated by a shared correlation with cognition. PMID:27242564
Full Text Available In speech perception, a functional hierarchy has been proposed by recent functional neuroimaging studies: core auditory areas on the dorsal plane of superior temporal gyrus (STG are sensitive to basic acoustic characteristics, whereas downstream regions, specifically the left superior temporal sulcus (STS and middle temporal gyrus (MTG ventral to Heschl's gyrus (HG are responsive to abstract phonological features. What is unclear so far is the relationship between the dorsal and ventral processes, especially with regard to whether low-level acoustic processing is modulated by high-level phonological processing. To address the issue, we assessed sensitivity of core auditory and downstream regions to acoustic and phonological variations by using within- and across-category lexical tonal continua with equal physical intervals. We found that relative to within-category variation, across-category variation elicited stronger activation in the left middle MTG (mMTG, apparently reflecting the abstract phonological representations. At the same time, activation in the core auditory region decreased, resulting from the top-down influences of phonological processing. These results support a hierarchical organization of the ventral acoustic-phonological processing stream, which originates in the right HG/STG and projects to the left mMTG. Furthermore, our study provides direct evidence that low-level acoustic analysis is modulated by high-level phonological representations, revealing the cortical dynamics of acoustic and phonological processing in speech perception. Our findings confirm the existence of reciprocal progression projections in the auditory pathways and the roles of both feed-forward and feedback mechanisms in speech perception.
Patro, Chhayakanta; Mendel, Lisa Lucks
Understanding speech within an auditory scene is constantly challenged by interfering noise in suboptimal listening environments when noise hinders the continuity of the speech stream. In such instances, a typical auditory-cognitive system perceptually integrates available speech information and "fills in" missing information in the light of semantic context. However, individuals with cochlear implants (CIs) find it difficult and effortful to understand interrupted speech compared to their normal hearing counterparts. This inefficiency in perceptual integration of speech could be attributed to further degradations in the spectral-temporal domain imposed by CIs making it difficult to utilize the contextual evidence effectively. To address these issues, 20 normal hearing adults listened to speech that was spectrally reduced and spectrally reduced interrupted in a manner similar to CI processing. The Revised Speech Perception in Noise test, which includes contextually rich and contextually poor sentences, was used to evaluate the influence of semantic context on speech perception. Results indicated that listeners benefited more from semantic context when they listened to spectrally reduced speech alone. For the spectrally reduced interrupted speech, contextual information was not as helpful under significant spectral reductions, but became beneficial as the spectral resolution improved. These results suggest top-down processing facilitates speech perception up to a point, and it fails to facilitate speech understanding when the speech signals are significantly degraded. PMID:27586760
Mealings, Kiri T.; Demuth, Katherine; Buchholz, Jörg; Dillon, Harvey
Purpose: Open-plan classroom styles are increasingly being adopted in Australia despite evidence that their high intrusive noise levels adversely affect learning. The aim of this study was to develop a new Australian speech perception task (the Mealings, Demuth, Dillon, and Buchholz Classroom Speech Perception Test) and use it in an open-plan…
Ziegler, Johannes C.; Pech-Georgel, Catherine; George, Florence; Lorenzi, Christian
Speech perception of four phonetic categories (voicing, place, manner, and nasality) was investigated in children with specific language impairment (SLI) (n=20) and age-matched controls (n=19) in quiet and various noise conditions using an AXB two-alternative forced-choice paradigm. Children with SLI exhibited robust speech perception deficits in…
Mirman, Daniel; McClelland, James L; Holt, Lori L.
We describe an account of lexically guided tuning of speech perception based on interactive processing and Hebbian learning. Interactive feedback provides lexical information to prelexical levels, and Hebbian learning uses that information to retune the mapping from auditory input to prelexical representations of speech. Simulations of an extension of the TRACE model of speech perception are presented that demonstrate the efficacy of this mechanism. Further simulations show that acoustic simi...
Nabelek, Anna K.; Tampas, Joanna W.; Burchfield, Samuel B.
l, speech perception in noiseBackground noise is a significant factor influencing hearing-aid satisfaction and is a major reason for rejection of hearing aids. Attempts have been made by previous researchers to relate the use of hearing aids to speech perception in noise (SPIN), with an expectation of improved speech perception followed by an…
Naturally produced English clear speech has been shown to be more intelligible than English conversational speech. However, little is known about the extent of the clear speech effects in the production of nonnative English, and perception of foreign-accented English by younger and older listeners. The present study examined whether Cantonese speakers would employ the same strategies as those used by native English speakers in producing clear speech in their second language. Also, the clear s...
Oded Ghitza; Anne-Lise Giraud; David Poeppel
A RECENT OPINION ARTICLE (NEURAL OSCILLATIONS IN SPEECH: do not be enslaved by the envelope. Obleser et al., 2012) questions the validity of a class of speech perception models inspired by the possible role of neuronal oscillations in decoding speech (e.g., Ghitza, 2011; Giraud and Poeppel, 2012). The authors criticize, in particular, what they see as an over-emphasis of the role of temporal speech envelope information, and an over-emphasis of entrainment to the input rhythm while neglecting ...
Hu, Yi; Tahmina, Qudsia; Runge, Christina; Friedland, David R.
This study assesses the effects of adding low- or high-frequency information to the band-limited telephone-processed speech on bimodal listeners’ telephone speech perception in quiet environments. In the proposed experiments, bimodal users were presented under quiet listening conditions with wideband speech (WB), bandpass-filtered telephone speech (300–3,400 Hz, BP), high-pass filtered speech (f > 300 Hz, HP, i.e., distorted frequency components above 3,400 Hz in telephone speech were restore...
Full Text Available Listeners must accomplish two complementary perceptual feats in extracting a message from speech. They must discriminate linguistically-relevant acoustic variability and generalize across irrelevant variability. Said another way, they must categorize speech. Since the mapping of acoustic variability is language-specific, these categories must be learned from experience. Thus, understanding how, in general, the auditory system acquires and represents categories can inform us about the toolbox of mechanisms available to speech perception. This perspective invites consideration of findings from cognitive neuroscience literatures outside of the speech domain as a means of constraining models of speech perception. Although neurobiological models of speech perception have mainly focused on cerebral cortex, research outside the speech domain is consistent with the possibility of significant subcortical contributions in category learning. Here, we review the functional role of one such structure, the basal ganglia. We examine research from animal electrophysiology, human neuroimaging, and behavior to consider characteristics of basal ganglia processing that may be advantageous for speech category learning. We also present emerging evidence for a direct role for basal ganglia in learning auditory categories in a complex, naturalistic task intended to model the incidental manner in which speech categories are acquired. To conclude, we highlight new research questions that arise in incorporating the broader neuroscience research literature in modeling speech perception, and suggest how understanding contributions of the basal ganglia can inform attempts to optimize training protocols for learning non-native speech categories in adulthood.
Beasley, Augie E.; And Others
Six articles on the use of audiovisual materials in the school library media center cover how to develop an audiovisual production center; audiovisual forms; a checklist for effective video/16mm use in the classroom; slides in learning; hazards of videotaping in the library; and putting audiovisuals on the shelf. (EJS)
Lyxell, B; Rönnberg, J; Andersson, J; Linderoth, E
The study investigated the initial effects of the implementation of vibrotactile support on the individual's speech perception ability. Thirty-two subjects participated in the study; 16 with an acquired deafness and 16 with normal hearing. At a general level, the results indicated no immediate and direct improvement as a function of the implementation across all speech perception tests. However, when the subjects were divided into Skilled and Less Skilled groups, based on their performance in the visual condition of each test, it was found that the performance of the Skilled subjects deteriorated while that of the Less Skilled subjects improved when tactile information was provided in two conditions (word-discrimination and word-decoding conditions). It was concluded that tactile information interferes with Skilled subjects' automaticity of these functions. Furthermore, intercorrelations between discrimination and decoding tasks suggest that there are similarities between visually and tactilely supported speechreading in how they relate to sentence-based speechreading. Clinical implications of the results were discussed. PMID:8210957
Lev-Ari, Shiri; Peperkamp, Sharon
Speech perception is known to be influenced by listeners' expectations of the speaker. This paper tests whether the demographic makeup of individuals' communities can influence their perception of foreign sounds by influencing their expectations of the language. Using online experiments with participants from all across the U.S. and matched census data on the proportion of Spanish and other foreign language speakers in participants' communities, this paper shows that the demographic makeup of individuals' communities influences their expectations of foreign languages to have an alveolar trill versus a tap (Experiment 1), as well as their consequent perception of these sounds (Experiment 2). Thus, the paper shows that while individuals' expectations of foreign language to have a trill occasionally lead them to misperceive a tap in a foreign language as a trill, a higher proportion of non-trill language speakers in one's community decreases this likelihood. These results show that individuals' environment can influence their perception by shaping their linguistic expectations. PMID:27369129
Full Text Available Human locomotion typically creates noise, a possible consequence of which is the masking of sound signals originating in the surroundings. When walking side by side, people often subconsciously synchronize their steps. The neurophysiological and evolutionary background of this behavior is unclear. The present study investigated the potential of sound created by walking to mask perception of speech and compared the masking produced by walking in step with that produced by unsynchronized walking. The masking sound (footsteps on gravel and the target sound (speech were presented through the same speaker to 15 normal-hearing subjects. The original recorded walking sound was modified to mimic the sound of two individuals walking in pace or walking out of synchrony. The participants were instructed to adjust the sound level of the target sound until they could just comprehend the speech signal ("just follow conversation" or JFC level when presented simultaneously with synchronized or unsynchronized walking sound at 40 dBA, 50 dBA, 60 dBA, or 70 dBA. Synchronized walking sounds produced slightly less masking of speech than did unsynchronized sound. The median JFC threshold in the synchronized condition was 38.5 dBA, while the corresponding value for the unsynchronized condition was 41.2 dBA. Combined results at all sound pressure levels showed an improvement in the signal-to-noise ratio (SNR for synchronized footsteps; the median difference was 2.7 dB and the mean difference was 1.2 dB [P < 0.001, repeated-measures analysis of variance (RM-ANOVA]. The difference was significant for masker levels of 50 dBA and 60 dBA, but not for 40 dBA or 70 dBA. This study provides evidence that synchronized walking may reduce the masking potential of footsteps.
The perception of human languages is inherently a multi-modalprocess, in which audio information can be compensated by visual information to improve the recognition performance. Such a phenomenon in English, German, Spanish and so on has been researched, but in Chinese it has not been reported yet. In our experiment, 14 syllables (/ba, bi, bian, biao, bin, de, di, dian, duo, dong, gai, gan, gen, gu/), extracted from Chinese audiovisual bimodal speech database CAVSR-1.0, were pronounced by 10 subjects. The audio-only stimuli, audiovisual stimuli, and visual-only stimuli were recognized by 20 observers. The audio-only stimuli and audiovisual stimuli both were presented under 5 conditions: no noise, SNR 0 dB, -8 dB, -12 dB, and -16 dB. The experimental result is studied and the following conclusions for Chinese speech are reached. Human beings can recognize visual-only stimuli rather well. The place of articulation determines the visual distinction. In noisy environment, audio information can remarkably be compensated by visual information and as a result the recognition performance is greatly improved.
Full Text Available Production and comprehension of speech are closely interwoven. For example, the ability todetect an error in one's own speech, halt speech production, and finally correct the error can beexplained by assuming an inner speech loop which continuously compares the word representationsinduced by production to those induced by perception at various cognitive levels (e.g. conceptual, word,or phonological levels. Because spontaneous speech errors are relatively rare, a picture naming and haltparadigm can be used to evoke them. In this paradigm, picture presentation (target word initiation isfollowed by an auditory stop signal (distractor word for halting speech production. The current studyseeks to understand the neural mechanisms governing self-detection of speech errors by developing abiologically inspired neural model of the inner speech loop. The neural model is based on the NeuralEngineering Framework (NEF and consists of a network of about 500,000 spiking neurons. In the firstexperiment we induce simulated speech errors semantically and phonologically. In the secondexperiment, we simulate a picture naming and halt task. Target-distractor word pairs were balanced withrespect to variation of phonological and semantic similarity. The results of the first experiment show thatspeech errors are successfully detected by a monitoring component in the inner speech loop. The resultsof the second experiment show that the model correctly reproduces human behavioral data on thepicture naming and halt task. In particular, the halting rate in the production of target words was lowerfor phonologically similar words than for semantically similar or fully dissimilar distractor words. We thusconclude that the neural architecture proposed here to model the inner speech loop reflects importantinteractions in production and perception at phonological and semantic levels.
Full Text Available Audio‐visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi‐stream Dynamic Bayesian Network and coupled HMM are widely used for audio‐visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN to perform unsupervised extraction of spatial‐temporal multimodal features from Tibetan audio‐visual speech data and build an accurate audio‐visual speech recognition model under a no frame‐independency assumption. The experiment results on Tibetan speech data from some real‐world environments showed the proposed DDBN outperforms the state‐of‐art methods in word recognition accuracy.
Mair, K. R.
High-functioning adults with autism spectrum disorder (ASD) frequently report difficulties with speech perception in background noise, which cannot be explained simply by an impairment in peripheral hearing or structural language ability. In spite of the apparent prevalence of this problem, however, only a handful of studies so far have evaluated speech reception thresholds (SRTs) in this group under controlled conditions, and then only with a limited range of (mainly non-speech) masking soun...
Mitterer, Holger; Ernestus, Mirjam
This study reports a shadowing experiment, in which one has to repeat a speech stimulus as fast as possible. We tested claims about a direct link between perception and production based on speech gestures, and obtained two types of counterevidence. First, shadowing is not slowed down by a gestural mismatch between stimulus and response. Second,…
Viswanathan, Navin; Magnuson, James S.; Fowler, Carol A.
According to one approach to speech perception, listeners perceive speech by applying general pattern matching mechanisms to the acoustic signal (e.g., Diehl, Lotto, & Holt, 2004). An alternative is that listeners perceive the phonetic gestures that structured the acoustic signal (e.g., Fowler, 1986). The two accounts have offered different…
Lavie, Limor; Banai, Karen; Karni, Avi; Attias, Joseph
Purpose: We tested whether using hearing aids can improve unaided performance in speech perception tasks in older adults with hearing impairment. Method: Unaided performance was evaluated in dichotic listening and speech-in-noise tests in 47 older adults with hearing impairment; 36 participants in 3 study groups were tested before hearing aid…
Casserly, Elizabeth D.
Real-time use of spoken language is a fundamentally interactive process involving speech perception, speech production, linguistic competence, motor control, neurocognitive abilities such as working memory, attention, and executive function, environmental noise, conversational context, and--critically--the communicative interaction between…
Guediche, Sara; Blumstein, Sheila E.; Fiez, Julie A.; Holt, Lori L.
Adult speech perception reflects the long-term regularities of the native language, but it is also flexible such that it accommodates and adapts to adverse listening conditions and short-term deviations from native-language norms. The purpose of this article is to examine how the broader neuroscience literature can inform and advance research efforts in understanding the neural basis of flexibility and adaptive plasticity in speech perception. Specifically, we highlight the potential role of ...
Lori Astheimer; Monika Janus; Sylvain Moreno; Ellen Bialystok
Event-related potential (ERP) evidence demonstrates that preschool-aged children selectively attend to informative moments such as word onsets during speech perception. Although this observation indicates a role for attention in language processing, it is unclear whether this type of attention is part of basic speech perception mechanisms, higher-level language skills, or general cognitive abilities. The current study examined these possibilities by measuring ERPs from 5-year-old children lis...
Eskelund, Kasper; MacDonald, Ewen; Andersen, Tobias
We perceive identity, expression and speech from faces. While perception of identity and expression depends crucially on the configuration of facial features it is less clear whether this holds for visual speech perception.Facial configuration is poorly perceived for upside-down faces as demonstrated by the Thatcher illusion in which the orientation of the eyes and mouth with respect to the face is inverted (Thatcherization). This gives the face a grotesque appearance but this is only seen wh...
Barbulescu, Adela; Bailly, Gérard; Ronfard, Rémi; Pouget, Maël
The focus of this study is the generation of expressive audiovisual speech from neutral utterances for 3D virtual actors. Taking into account the segmental and suprasegmental aspects of audiovisual speech, we propose and compare several computational frameworks for the generation of expressive speech and face animation. We notably evaluate a standard frame-based conversion approach with two other methods that postulate the existence of global prosodic audiovisual patterns that are characteris...
Shojaei, Elahe; Ashayeri, Hassan; Jafari, Zahra; Zarrin Dast, Mohammad Reza; Kamali, Koorosh
Background: Speech perception ability depends on auditory and extra-auditory elements. The signal- to-noise ratio (SNR) is an extra-auditory element that has an effect on the ability to normally follow speech and maintain a conversation. Speech in noise perception difficulty is a common complaint of the elderly. In this study, the importance of SNR magnitude as an extra-auditory effect on speech perception in noise was examined in the elderly. Methods: The speech perception in noise test (SPIN) was conducted on 25 elderly participants who had bilateral low–mid frequency normal hearing thresholds at three SNRs in the presence of ipsilateral white noise. These participants were selected by available sampling method. Cognitive screening was done using the Persian Mini Mental State Examination (MMSE) test. Results: Independent T- test, ANNOVA and Pearson Correlation Index were used for statistical analysis. There was a significant difference in word discrimination scores at silence and at three SNRs in both ears (p≤0.047). Moreover, there was a significant difference in word discrimination scores for paired SNRs (0 and +5, 0 and +10, and +5 and +10 (p≤0.04)). No significant correlation was found between age and word recognition scores at silence and at three SNRs in both ears (p≥0.386). Conclusion: Our results revealed that decreasing the signal level and increasing the competing noise considerably reduced the speech perception ability in normal hearing at low–mid thresholds in the elderly. These results support the critical role of SNRs for speech perception ability in the elderly. Furthermore, our results revealed that normal hearing elderly participants required compensatory strategies to maintain normal speech perception in challenging acoustic situations. PMID:27390712
Anderson, Samira; Kraus, Nina
Numerous factors contribute to understanding speech in noisy listening environments. There is a clinical need for objective biological assessment of auditory factors that contribute to the ability to hear speech in noise, factors that are free from the demands of attention and memory. Subcortical processing of complex sounds such as speech (auditory brainstem responses to speech and other complex stimuli [cABRs]) reflects the integrity of auditory function. Because cABRs physically resemble t...
Full Text Available The present study was intended to make electrophysiological investigations into the preattentive perception of native and non-native speech sounds. We recorded the mismatch negativity, elicited by single syllable change of both native and non-native speech-sound contrasts in tonal languages. EEGs were recorded and low-resolution brain electromagnetic tomography (LORETA was utilized to explore the neural electrical activity. Our results suggested that the left hemisphere was predominant in the perception of native speech sounds, whereas the non-native speech sound was perceived predominantly by the right hemisphere, which may be explained by the specialization in processing the prosodic and emotional components of speech formed in this hemisphere.
Crew, Joseph D; Galvin, John J; Landsberger, David M; Fu, Qian-Jie
Cochlear implant (CI) users have difficulty understanding speech in noisy listening conditions and perceiving music. Aided residual acoustic hearing in the contralateral ear can mitigate these limitations. The present study examined contributions of electric and acoustic hearing to speech understanding in noise and melodic pitch perception. Data was collected with the CI only, the hearing aid (HA) only, and both devices together (CI+HA). Speech reception thresholds (SRTs) were adaptively measured for simple sentences in speech babble. Melodic contour identification (MCI) was measured with and without a masker instrument; the fundamental frequency of the masker was varied to be overlapping or non-overlapping with the target contour. Results showed that the CI contributes primarily to bimodal speech perception and that the HA contributes primarily to bimodal melodic pitch perception. In general, CI+HA performance was slightly improved relative to the better ear alone (CI-only) for SRTs but not for MCI, with some subjects experiencing a decrease in bimodal MCI performance relative to the better ear alone (HA-only). Individual performance was highly variable, and the contribution of either device to bimodal perception was both subject- and task-dependent. The results suggest that individualized mapping of CIs and HAs may further improve bimodal speech and music perception. PMID:25790349
Joseph D Crew
Full Text Available Cochlear implant (CI users have difficulty understanding speech in noisy listening conditions and perceiving music. Aided residual acoustic hearing in the contralateral ear can mitigate these limitations. The present study examined contributions of electric and acoustic hearing to speech understanding in noise and melodic pitch perception. Data was collected with the CI only, the hearing aid (HA only, and both devices together (CI+HA. Speech reception thresholds (SRTs were adaptively measured for simple sentences in speech babble. Melodic contour identification (MCI was measured with and without a masker instrument; the fundamental frequency of the masker was varied to be overlapping or non-overlapping with the target contour. Results showed that the CI contributes primarily to bimodal speech perception and that the HA contributes primarily to bimodal melodic pitch perception. In general, CI+HA performance was slightly improved relative to the better ear alone (CI-only for SRTs but not for MCI, with some subjects experiencing a decrease in bimodal MCI performance relative to the better ear alone (HA-only. Individual performance was highly variable, and the contribution of either device to bimodal perception was both subject- and task-dependent. The results suggest that individualized mapping of CIs and HAs may further improve bimodal speech and music perception.
Judge, S.; Robertson, Z.; Hawley, M.
This study set out to collect data from assistive technology professionals about their provision of speech-driven environmental control systems. This study is part of a larger study looking at developing a new speech-driven environmental control system.
Full Text Available Objective: To determine speech-perception-in-noise (with speech and noise spatially distinct and coincident and bilateral spatial benefits of head-shadow effect, summation, squelch and spatial release of masking in adults with delayed sequential cochlear implants. Study design: A cross-sectional one group post-test-only exploratory design was employed. Eleven adults (mean age 47 years; range 21 – 69 years of the Pretoria Cochlear Implant Programme (PCIP in South Africa with a bilateral severe-to-profound sensorineural hearing loss were recruited. Prerecorded Everyday Speech Sentences of The Central Institute for the Deaf (CID were used to evaluate participants’ speech-in-noise perception at sentence level. An adaptive procedure was used to determine the signal-to-noise ratio (SNR, in dB at which the participant’s speech reception threshold (SRT was achieved. Specific calculations were used to estimate bilateral spatial benefit effects. Results: A minimal bilateral benefit for speech-in-noise perception was observed with noise directed to the first implant (CI 1 (1.69 dB and in the speech and noise spatial listening condition (0.78 dB, but was not statistically significant. The head-shadow effect at 180° was the most robust bilateral spatial benefit. An improvement in speech perception in spatially distinct speech and noise indicates the contribution of the second implant (CI 2 is greater than that of the first implant (CI 1 for bilateral spatial benefit. Conclusion: Bilateral benefit for delayed sequentially implanted adults is less than previously reported for simultaneous and sequentially implanted adults. Delayed sequential implantation benefit seems to relate to the availability of the ear with the most favourable SNR.
Kleinschmidt, Dave F; Jaeger, T Florian
Successful speech perception requires that listeners map the acoustic signal to linguistic categories. These mappings are not only probabilistic, but change depending on the situation. For example, one talker's /p/ might be physically indistinguishable from another talker's /b/ (cf. lack of invariance). We characterize the computational problem posed by such a subjectively nonstationary world and propose that the speech perception system overcomes this challenge by (a) recognizing previously encountered situations, (b) generalizing to other situations based on previous similar experience, and (c) adapting to novel situations. We formalize this proposal in the ideal adapter framework: (a) to (c) can be understood as inference under uncertainty about the appropriate generative model for the current talker, thereby facilitating robust speech perception despite the lack of invariance. We focus on 2 critical aspects of the ideal adapter. First, in situations that clearly deviate from previous experience, listeners need to adapt. We develop a distributional (belief-updating) learning model of incremental adaptation. The model provides a good fit against known and novel phonetic adaptation data, including perceptual recalibration and selective adaptation. Second, robust speech recognition requires that listeners learn to represent the structured component of cross-situation variability in the speech signal. We discuss how these 2 aspects of the ideal adapter provide a unifying explanation for adaptation, talker-specificity, and generalization across talkers and groups of talkers (e.g., accents and dialects). The ideal adapter provides a guiding framework for future investigations into speech perception and adaptation, and more broadly language comprehension. PMID:25844873
Evans, S; McGettigan, C; Agnew, ZK; Rosen, S; Scott, SK
Spoken conversations typically take place in noisy environments and different kinds of masking sounds place differing demands on cognitive resources. Previous studies, examining the modulation of neural activity associated with the properties of competing sounds, have shown that additional speech streams engage the superior temporal gyrus. However, the absence of a condition in which target speech was heard without additional masking made it difficult to identify brain networks specific to masking and to ascertain the extent to which competing speech was processed equivalently to target speech. In this study, we scanned young healthy adults with continuous functional Magnetic Resonance Imaging (fMRI), whilst they listened to stories masked by sounds that differed in their similarity to speech. We show that auditory attention and control networks are activated during attentive listening to masked speech in the absence of an overt behavioural task. We demonstrate that competing speech is processed predominantly in the left hemisphere within the same pathway as target speech but is not treated equivalently within that stream, and that individuals who perform better in speech in noise tasks activate the left mid-posterior superior temporal gyrus more. Finally, we identify neural responses associated with the onset of sounds in the auditory environment, activity was found within right lateralised frontal regions consistent with a phasic alerting response. Taken together, these results provide a comprehensive account of the neural processes involved in listening in noise. PMID:26696297
Full Text Available For deaf individuals with residual low-frequency acoustic hearing, combined use of a cochlear implant (CI and hearing aid (HA typically provides better speech understanding than with either device alone. Because of coarse spectral resolution, CIs do not provide fundamental frequency (F0 information that contributes to understanding of tonal languages such as Mandarin Chinese. The HA can provide good representation of F0 and, depending on the range of aided acoustic hearing, first and second formant (F1 and F2 information. In this study, Mandarin tone, vowel, and consonant recognition in quiet and noise was measured in 12 adult Mandarin-speaking bimodal listeners with the CI-only and with the CI+HA. Tone recognition was significantly better with the CI+HA in noise, but not in quiet. Vowel recognition was significantly better with the CI+HA in quiet, but not in noise. There was no significant difference in consonant recognition between the CI-only and the CI+HA in quiet or in noise. There was a wide range in bimodal benefit, with improvements often greater than 20 percentage points in some tests and conditions. The bimodal benefit was compared to CI subjects' HA-aided pure-tone average (PTA thresholds between 250 and 2000 Hz; subjects were divided into two groups: "better" PTA (50 dB HL. The bimodal benefit differed significantly between groups only for consonant recognition. The bimodal benefit for tone recognition in quiet was significantly correlated with CI experience, suggesting that bimodal CI users learn to better combine low-frequency spectro-temporal information from acoustic hearing with temporal envelope information from electric hearing. Given the small number of subjects in this study (n = 12, further research with Chinese bimodal listeners may provide more information regarding the contribution of acoustic and electric hearing to tonal language perception.
Pyschny, Verena; Landwehr, Markus; Hahn, Moritz; Walger, Martin; von Wedel, Hasso; Meister, Hartmut
Purpose: The objective of the study was to investigate the influence of bimodal stimulation upon hearing ability for speech recognition in the presence of a single competing talker. Method: Speech recognition was measured in 3 listening conditions: hearing aid (HA) alone, cochlear implant (CI) alone, and both devices together (CI + HA). To examine…
The general topic addressed by this dissertation is that of bilingualism, and more specifically, the topic of bilingual acquisition of speech sounds. The central question in this study is the following: does bilingualism affect children’s perceptual development of speech sounds? The term bilingual i
Carlin, Charles H.; Jennifer L. Milam; Carlin, Emily L.; Ashley Owen
E-supervision has a potential role in addressing speech-language personnel shortages in rural and difficult to staff school districts. The purposes of this article are twofold: to determine how e-supervision might support graduate speech-language pathologist (SLP) interns placed in rural, remote, and difficult to staff public school districts; and, to investigate interns’ perceptions of in-person supervision compared to e-supervision. The study used a mixed methodology approach and collected ...
Slim Ouni; Michael M. Cohen; Hope Ishak; Massaro, Dominic W
Animated agents are becoming increasingly frequent in research and applications in speech science. An important challenge is to evaluate the effectiveness of the agent in terms of the intelligibility of its visible speech. In three experiments, we extend and test the Sumby and Pollack (1954) metric to allow the comparison of an agent relative to a standard or reference, and also propose a new metric based on the fuzzy logical model of perception (FLMP) to describe the benefit provided b...
Naiphinich Kotchabhakdi; Chittin Chindaduangratn; Wichian Sittiprapaporn
The present study was intended to make electrophysiological investigations into the preattentive perception of native and non-native speech sounds. We recorded the mismatch negativity, elicited by single syllable change of both native and non-native speech-sound contrasts in tonal languages. EEGs were recorded and low-resolution brain electromagnetic tomography (LORETA) was utilized to explore the neural electrical activity. Our results suggested that the left hemisphere was predominant in th...
Tobias Weissgerber; Tobias Rader; Uwe Baumann
Objectives: Previous studies investigating speech perception in noise have typically been conducted with static masker positions. The aim of this study was to investigate the effect of spatial separation of source and masker (spatial release from masking, SRM) in a moving masker setup and to evaluate the impact of adaptive beamforming in comparison with fixed directional microphones in cochlear implant (CI) users. Design: Speech reception thresholds (SRT) were measured in S0N0 and in a mov...
Dole, Marjorie; Hoen, Michel; Meunier, Fanny
Developmental dyslexia is associated with impaired speech-in-noise perception. The goal of the present research was to further characterize this deficit in dyslexic adults. In order to specify the mechanisms and processing strategies used by adults with dyslexia during speech-in-noise perception, we explored the influence of background type,…
Dole, Marjorie; Hoen, Michel; Meunier, Fanny
Developmental dyslexia is associated with impaired speech-in-noise perception. The goal of the present research was to further characterize this deficit in dyslexic adults. In order to specify the mechanisms and processing strategies used by adults with dyslexia during speech-in-noise perception, we explored the influence of background type, presenting single target-words against backgrounds made of cocktail party sounds, modulated speech-derived noise or stationary noise. We also evaluated t...
Portnova Galina; Martynova Olga
The perception of complex auditory information such as complete speech sequences develops during human ontogeny. In order to explore age differences in the auditory perception of predictable speech sequences we compared event-related potentials (ERPs) recorded in 5- to 6-year-old children (N = 15) and adults (N = 15) in response to anticipated speech sequences as successive and reverse digital series with randomly omitted digits. The ERPs obtained from the omitted digits significantly differe...
Plant, G L
Four subjects fitted with single-channel vibrotactile aids and provided with training in their use took part in a testing programme aimed at assessing their aided and unaided lipreading performance, their ability to detect segmental and suprasegmental features of speech, and the discrimination of common environmental sounds. The results showed that the vibrotactile aid provided very useful information as to speech and non-speech stimuli with the subjects performing best on those tasks where time/intensity cues provided sufficient information to enable identification. The implications of the study are discussed and a comparison made with those results reported for subjects using cochlear implants. PMID:6897619
Munson, Benjamin; Johnson, Julie M.; Edwards, Jan
Purpose: This study examined whether experienced speech-language pathologists (SLPs) differ from inexperienced people in their perception of phonetic detail in children's speech. Method: Twenty-one experienced SLPs and 21 inexperienced listeners participated in a series of tasks in which they used a visual-analog scale (VAS) to rate children's…
Full Text Available Musicians have a more accurate temporal and tonal representation of auditory stimuli than their non-musician counterparts (Kraus & Chandrasekaran, 2010; Parbery-Clark, Skoe, & Kraus, 2009; Zendel & Alain, 2008; Musacchia, Sams, Skoe, & Kraus, 2007. Musicians who are adept at the production and perception of music are also more sensitive to key acoustic features of speech such as voice onset timing and pitch. Together, these data suggest that musical training may enhance the processing of acoustic information for speech sounds. In the current study, we sought to provide neural evidence that musicians process speech and music in a similar way. We hypothesized that for musicians, right hemisphere areas traditionally associated with music are also engaged for the processing of speech sounds. In contrast we predicted that in non-musicians processing of speech sounds would be localized to traditional left hemisphere language areas. Speech stimuli differing in voice onset time was presented using a dichotic listening paradigm. Subjects either indicated aural location for a specified speech sound or identified a specific speech sound from a directed aural location. Musical training effects and organization of acoustic features were reflected by activity in source generators of the P50. This included greater activation of right middle temporal gyrus (MTG and superior temporal gyrus (STG in musicians. The findings demonstrate recruitment of right hemisphere in musicians for discriminating speech sounds and a putative broadening of their language network. Musicians appear to have an increased sensitivity to acoustic features and enhanced selective attention to temporal features of speech that is facilitated by musical training and supported, in part, by right hemisphere homologues of established speech processing regions of the brain.
Bhat, Jyoti; Miller, Lee M; Pitt, Mark A; Shahin, Antoine J
Audiovisual (AV) speech perception is robust to temporal asynchronies between visual and auditory stimuli. We investigated the neural mechanisms that facilitate tolerance for audiovisual stimulus onset asynchrony (AVOA) with EEG. Individuals were presented with AV words that were asynchronous in onsets of voice and mouth movement and judged whether they were synchronous or not. Behaviorally, individuals tolerated (perceived as synchronous) longer AVOAs when mouth movement preceded the speech (V-A) stimuli than when the speech preceded mouth movement (A-V). Neurophysiologically, the P1-N1-P2 auditory evoked potentials (AEPs), time-locked to sound onsets and known to arise in and surrounding the primary auditory cortex (PAC), were smaller for the in-sync than the out-of-sync percepts. Spectral power of oscillatory activity in the beta band (14-30 Hz) following the AEPs was larger during the in-sync than out-of-sync perception for both A-V and V-A conditions. However, alpha power (8-14 Hz), also following AEPs, was larger for the in-sync than out-of-sync percepts only in the V-A condition. These results demonstrate that AVOA tolerance is enhanced by inhibiting low-level auditory activity (e.g., AEPs representing generators in and surrounding PAC) that code for acoustic onsets. By reducing sensitivity to acoustic onsets, visual-to-auditory onset mapping is weakened, allowing for greater AVOA tolerance. In contrast, beta and alpha results suggest the involvement of higher-level neural processes that may code for language cues (phonetic, lexical), selective attention, and binding of AV percepts, allowing for wider neural windows of temporal integration, i.e., greater AVOA tolerance. PMID:25505102
AndrewLeeBowers; DavidJenson; MeganCuellar
Activity in premotor and sensorimotor cortices is found in speech production and some perception tasks. Yet, how sensorimotor integration supports these functions is unclear due to a lack of data examining the timing of activity from these regions. Beta (~20Hz) and alpha (~10Hz) spectral power within the EEG µ rhythm are considered indices of motor and somatosensory activity, respectively. In the current study, perception conditions required discrimination (same/different) of syllables pai...
Berisha, Visar; Liss, Julie; Sandoval, Steven; Utianski, Rene; Spanias, Andreas
The current state of the art in judging pathological speech intelligibility is subjective assessment performed by trained speech pathologists (SLP). These tests, however, are inconsistent, costly and, oftentimes suffer from poor intra- and inter-judge reliability. As such, consistent, reliable, and perceptually-relevant objective evaluations of pathological speech are critical. Here, we propose a data-driven approach to this problem. We propose new cost functions for examining data from a series of experiments, whereby we ask certified SLPs to rate pathological speech along the perceptual dimensions that contribute to decreased intelligibility. We consider qualitative feedback from SLPs in the form of comparisons similar to statements "Is Speaker A's rhythm more similar to Speaker B or Speaker C?" Data of this form is common in behavioral research, but is different from the traditional data structures expected in supervised (data matrix + class labels) or unsupervised (data matrix) machine learning. The proposed method identifies relevant acoustic features that correlate with the ordinal data collected during the experiment. Using these features, we show that we are able to develop objective measures of the speech signal degradation that correlate well with SLP responses. PMID:25435817
Mayo, L H; Florentine, M; Buus, S
To determine how age of acquisition influences perception of second-language speech, the Speech Perception in Noise (SPIN) test was administered to native Mexican-Spanish-speaking listeners who learned fluent English before age 6 (early bilinguals) or after age 14 (late bilinguals) and monolingual American-English speakers (monolinguals). Results show that the levels of noise at which the speech was intelligible were significantly higher and the benefit from context was significantly greater for monolinguals and early bilinguals than for late bilinguals. These findings indicate that learning a second language at an early age is important for the acquisition of efficient high-level processing of it, at least in the presence of noise. PMID:9210123
Fowler, Jennifer R.; Eggleston, Jessica L.; Reavis, Kelly M.; McMillan, Garnett P.; Reiss, Lina A. J.
Purpose: The objective was to determine whether speech perception could be improved for bimodal listeners (those using a cochlear implant [CI] in one ear and hearing aid in the contralateral ear) by removing low-frequency information provided by the CI, thereby reducing acoustic-electric overlap. Method: Subjects were adult CI subjects with at…
Blood, Gordon W.; Boyle, Michael P.; Blood, Ingrid M.; Nalesnik, Gina R.
Bullying in school-age children is a global epidemic. School personnel play a critical role in eliminating this problem. The goals of this study were to examine speech-language pathologists' (SLPs) perceptions of bullying, endorsement of potential strategies for dealing with bullying, and associations among SLPs' responses and specific demographic…
Ofe, Erin E.; Plumb, Allison M.; Plexico, Laura W.; Haak, Nancy J.
Purpose: The purpose of the current investigation was to examine speech-language pathologists' (SLPs') knowledge and perceptions of bullying, with an emphasis on autism spectrum disorder (ASD). Method: A 46-item, web-based survey was used to address the purposes of this investigation. Participants were recruited through e-mail and electronic…
Eskelund, Kasper; MacDonald, Ewen; Andersen, Tobias
demonstrated by the Thatcher illusion in which the orientation of the eyes and mouth with respect to the face is inverted (Thatcherization). This gives the face a grotesque appearance but this is only seen when the face is upright. Thatcherization can likewise disrupt visual speech perception but only when the...
Feldstein, Stanley; Dohm, Faith-Anne; Crown, Cynthia L.
Presents a study that explores (1) whether listeners regard speakers with similar global speech rates as more competent and attractive and (2) the influence of gender on their perceptions. Explains that the judges consisted of 17 male and 28 female listeners. (CMK)
Lee, Andrew H.; Lyster, Roy
To what extent do second language (L2) learners benefit from instruction that includes corrective feedback (CF) on L2 speech perception? This article addresses this question by reporting the results of a classroom-based experimental study conducted with 32 young adult Korean learners of English. An instruction-only group and an instruction + CF…
experiment, subjects rated the signals in regard to loudness, speech clarity, noisiness and overall acceptance. Based on the results, a criterion for selecting compression parameters that yield some level-variation in the output signal, while still keeping the overall user-acceptance at a tolerable level, is...... they become audible again for the hearing impaired person. The general goal is to place all sounds within the hearing aid users’ audible range, such that speech intelligibility and listening comfort become as good as possible. Amplification strategies in hearing aids are in many cases based on...... empirical research -for example investigations of loudness perception in hearing impaired listeners. Most research has been focused on speech and sounds at medium input-levels (e.g., 60-65 dB SPL). It is well documented that for speech at conversational levels, hearing aid-users prefer the signal to be...
Knowland, Victoria C. P.; Evans, Sam; Snell, Caroline; Rosen, Stuart
Purpose: The purpose of the study was to assess the ability of children with developmental language learning impairments (LLIs) to use visual speech cues from the talking face. Method: In this cross-sectional study, 41 typically developing children (mean age: 8 years 0 months, range: 4 years 5 months to 11 years 10 months) and 27 children with…
Approaches to music and audiovisual meaning in film appear to be very different in nature and scope when considered from the point of view of experimental psychology or humanistic studies. Nevertheless, this article argues that experimental studies square with ideas of audiovisual perception and ...
Biau, Emmanuel; Torralba, Mireia; Fuentemilla, Lluis; de Diego Balaguer, Ruth; Soto-Faraco, Salvador
Speakers often accompany speech with spontaneous beat gestures in natural spoken communication. These gestures are usually aligned with lexical stress and can modulate the saliency of their affiliate words. Here we addressed the consequences of beat gestures on the neural correlates of speech perception. Previous studies have highlighted the role played by theta oscillations in temporal prediction of speech. We hypothesized that the sight of beat gestures may influence ongoing low-frequency neural oscillations around the onset of the corresponding words. Electroencephalographic (EEG) recordings were acquired while participants watched a continuous, naturally recorded discourse. The phase-locking value (PLV) at word onset was calculated from the EEG from pairs of identical words that had been pronounced with and without a concurrent beat gesture in the discourse. We observed an increase in PLV in the 5-6 Hz theta range as well as a desynchronization in the 8-10 Hz alpha band around the onset of words preceded by a beat gesture. These findings suggest that beats help tune low-frequency oscillatory activity at relevant moments during natural speech perception, providing a new insight of how speech and paralinguistic information are integrated. PMID:25595613
Young, N M; Grohne, K M; Carrasco, V N; Brown, C
This study compares the auditory perceptual skill development of 23 congenitally deaf children who received the Nucleus 22-channel cochlear implant with the SPEAK speech coding strategy, and 20 children who received the CLARION Multi-Strategy Cochlear Implant with the Continuous Interleaved Sampler (CIS) speech coding strategy. All were under 5 years old at implantation. Preimplantation, there were no significant differences between the groups in age, length of hearing aid use, or communication mode. Auditory skills were assessed at 6 months and 12 months after implantation. Postimplantation, the mean scores on all speech perception tests were higher for the Clarion group. These differences were statistically significant for the pattern perception and monosyllable subtests of the Early Speech Perception battery at 6 months, and for the Glendonald Auditory Screening Procedure at 12 months. Multiple regression analysis revealed that device type accounted for the greatest variance in performance after 12 months of implant use. We conclude that children using the CIS strategy implemented in the Clarion implant may develop better auditory perceptual skills during the first year postimplantation than children using the SPEAK strategy with the Nucleus device. PMID:10214811
Stevenson, Ryan A; Segers, Magali; Ferber, Susanne; Barense, Morgan D; Camarata, Stephen; Wallace, Mark T
A growing area of interest and relevance in the study of autism spectrum disorder (ASD) focuses on the relationship between multisensory temporal function and the behavioral, perceptual, and cognitive impairments observed in ASD. Atypical sensory processing is becoming increasingly recognized as a core component of autism, with evidence of atypical processing across a number of sensory modalities. These deviations from typical processing underscore the value of interpreting ASD within a multisensory framework. Furthermore, converging evidence illustrates that these differences in audiovisual processing may be specifically related to temporal processing. This review seeks to bridge the connection between temporal processing and audiovisual perception, and to elaborate on emerging data showing differences in audiovisual temporal function in autism. We also discuss the consequence of such changes, the specific impact on the processing of different classes of audiovisual stimuli (e.g. speech vs. nonspeech, etc.), and the presumptive brain processes and networks underlying audiovisual temporal integration. Finally, possible downstream behavioral implications, and possible remediation strategies are outlined. Autism Res 2016, 9: 720-738. © 2015 International Society for Autism Research, Wiley Periodicals, Inc. PMID:26402725
Cristia, A.; Seidl, A; Junge, C.; Soderstrom, M.; Hagoort, P.
There are increasing reports that individual variation in behavioral and neurophysiological measures of infant speech processing predicts later language outcomes, and specifically concurrent or subsequent vocabulary size. If such findings are held up under scrutiny, they could both illuminate theoretical models of language development and contribute to the prediction of communicative disorders. A qualitative, systematic review of this emergent literature illustrated the variety of approaches ...
Davis, Matthew H.; Coleman, Martin R.; Absalom, Anthony R.; Rodd, Jennifer M.; Johnsrude, Ingrid S.; Matta, Basil F.; Owen, Adrian M.; Menon, David K.
We used functional MRI and the anesthetic agent propofol to assess the relationship among neural responses to speech, successful comprehension, and conscious awareness. Volunteers were scanned while listening to sentences containing ambiguous words, matched sentences without ambiguous words, and signal-correlated noise (SCN). During three scanning sessions, participants were nonsedated (awake), lightly sedated (a slowed response to conversation), and deeply sedated (no conversational response...
Miller, James D; Watson, Charles S; Dubno, Judy R; Leek, Marjorie R
Following an overview of theoretical issues in speech-perception training and of previous efforts to enhance hearing aid use through training, a multisite study, designed to evaluate the efficacy of two types of computerized speech-perception training for adults who use hearing aids, is described. One training method focuses on the identification of 109 syllable constituents (45 onsets, 28 nuclei, and 36 codas) in quiet and in noise, and on the perception of words in sentences presented in various levels of noise. In a second type of training, participants listen to 6- to 7-minute narratives in noise and are asked several questions about each narrative. Two groups of listeners are trained, each using one of these types of training, performed in a laboratory setting. The training for both groups is preceded and followed by a series of speech-perception tests. Subjects listen in a sound field while wearing their hearing aids at their usual settings. The training continues over 15 to 20 visits, with subjects completing at least 30 hours of focused training with one of the two methods. The two types of training are described in detail, together with a summary of other perceptual and cognitive measures obtained from all participants. PMID:27587914
Jianwu Dang; Masato Akagi; Kiyoshi Honda
Realization of an intelligent human-machine interface requires us to investigate human mechanisms and learn from them. This study focuses on communication between speech production and perception within human brain and realizing it in an artificial system. A physiological research study based on electromyographic signals (Honda, 1996) suggested that speech communication in human brain might be based on a topological mapping between speech production and perception, according to an analogous topology between motor and sensory representations. Following this hypothesis, this study first investigated the topologies of the vowel system across the motor, kinematic, and acoustic spaces by means of a model simulation, and then examined the linkage between vowel production and perception in terms of a transformed auditory feedback (TAF) experiment. The model simulation indicated that there exists an invariant mapping from muscle activations (motor space) to articulations (kinematic space) via a coordinate consisting of force-dependent equilibrium positions, and the mapping from the motor space to kinematic space is unique. The motor-kinematic-acoustic deduction in the model simulation showed that the topologies were compatible from one space to another. In the TAF experiment, vowel production exhibited a compensatory response for a perturbation in the feedback sound. This implied that vowel production is controlled in reference to perception monitoring.
Ten Oever, Sanne; Sack, Alexander T
The role of oscillatory phase for perceptual and cognitive processes is being increasingly acknowledged. To date, little is known about the direct role of phase in categorical perception. Here we show in two separate experiments that the identification of ambiguous syllables that can either be perceived as /da/ or /ga/ is biased by the underlying oscillatory phase as measured with EEG and sensory entrainment to rhythmic stimuli. The measured phase difference in which perception is biased toward /da/ or /ga/ exactly matched the different temporal onset delays in natural audiovisual speech between mouth movements and speech sounds, which last 80 ms longer for /ga/ than for /da/. These results indicate the functional relationship between prestimulus phase and syllable identification, and signify that the origin of this phase relationship could lie in exposure and subsequent learning of unique audiovisual temporal onset differences. PMID:26668393
Whistled speech is a little studied local use of language shaped by several cultures of the world either for distant dialogues or for rendering traditional songs. This practice consists of an emulation of the voice thanks to a simple modulated pitch. It is therefore the result of a transformation of the vocal signal that implies simplifications in the frequency domain. The whistlers adapt their productions to the way each language combines the qualities of height perceived simultaneously by the human ear in the complex frequency spectrum of the spoken or sung voice (pitch, timbre). As a consequence, this practice underlines key acoustic cues for the intelligibility of the concerned languages. The present study provides an analysis of the acoustic and phonetic features selected by whistled speech in several traditions either in purely oral whistles (Spanish, Turkish, Mazatec) or in whistles produced with an instrument like a leaf (Akha, Hmong). It underlines the convergences with the strategies of the singing ...
Clausen, Thomas Wolff
This paper includes a theoretical understanding of the affects of rhetoric, metaphors and dehumanisations in political speeches. This theoretical framework is used to analyse specific chosen speeches of Barack Obama, David Cameron, Donald Trump and Hillary Clinton. The analysis is done in order to get a comprehension of how rhetoric, metaphors and dehumanisations, in the analysed speeches, are influencing the image and perception of Muslims and IS. An understanding of what affect the modern m...
Given that Chinese language learners are greatly influenced by their mother-tongue, which is a tone language rather than an intonation language, learning and coping with authentic English speech seems more difficult than for learners of other languages. The focus of the current research is, on the basis of analysis of the nature of spoken English and spoken Chinese, to help Chinese learners derive benefit from ICT technologies developed by the Dublin Institute of Technology (DIT). The thesis ...
International audience Whistled speech is a little studied local use of language shaped by several cultures of the world either for distant dialogues or for rendering traditional songs. This practice consists of an emulation of the voice thanks to a simple modulated pitch. It is therefore the result of a transformation of the vocal signal that implies simplifications in the frequency domain. The whistlers adapt their productions to the way each language combines the qualities of height per...
José Antonio Palao Errando
En la llamada «sociedad de la información» los estudios sobre cine se han visto diluidos en el abordaje pragmático y tecnológico del discurso audiovisual, así como la propia fruición del cine se ha visto atrapada en la red del DVD y del hipertexto. El propio cine reacciona ante ello a través de estructuras narrativas complejas que lo alejan del discurso audiovisual estándar. La función de los estudios sobre cine y de su enseñanza universitaria debe ser la reintroducción del sujeto rechazado d...