Audiovisual information is integrated in speech perception. One manifestation of this is the McGurk illusion in which watching the articulating face alters the auditory phonetic percept. Understanding this phenomenon fully requires a computational model with predictive power. Here, we describe...
Yeung, H Henny; Werker, Janet F
Speech is robustly audiovisual from early in infancy. Here we show that audiovisual speech perception in 4.5-month-old infants is influenced by sensorimotor information related to the lip movements they make while chewing or sucking. Experiment 1 consisted of a classic audiovisual matching procedure, in which two simultaneously displayed talking faces (visual [i] and [u]) were presented with a synchronous vowel sound (audio /i/ or /u/). Infants' looking patterns were selectively biased away from the audiovisual matching face when the infants were producing lip movements similar to those needed to produce the heard vowel. Infants' looking patterns returned to those of a baseline condition (no lip movements, looking longer at the audiovisual matching face) when they were producing lip movements that did not match the heard vowel. Experiment 2 confirmed that these sensorimotor effects interacted with the heard vowel, as looking patterns differed when infants produced these same lip movements while seeing and hearing a talking face producing an unrelated vowel (audio /a/). These findings suggest that the development of speech perception and speech production may be mutually informative.
Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias
investigate whether the integration of auditory and visual speech observed in these two audiovisual integration effects are specific traits of speech perception. We further ask whether audiovisual integration is undertaken in a single processing stage or multiple processing stages....
Peelle, Jonathan E; Sommers, Mitchell S
During face-to-face conversational speech listeners must efficiently process a rapid and complex stream of multisensory information. Visual speech can serve as a critical complement to auditory information because it provides cues to both the timing of the incoming acoustic signal (the amplitude envelope, influencing attention and perceptual sensitivity) and its content (place and manner of articulation, constraining lexical selection). Here we review behavioral and neurophysiological evidence regarding listeners' use of visual speech information. Multisensory integration of audiovisual speech cues improves recognition accuracy, particularly for speech in noise. Even when speech is intelligible based solely on auditory information, adding visual information may reduce the cognitive demands placed on listeners through increasing the precision of prediction. Electrophysiological studies demonstrate that oscillatory cortical entrainment to speech in auditory cortex is enhanced when visual speech is present, increasing sensitivity to important acoustic cues. Neuroimaging studies also suggest increased activity in auditory cortex when congruent visual information is available, but additionally emphasize the involvement of heteromodal regions of posterior superior temporal sulcus as playing a role in integrative processing. We interpret these findings in a framework of temporally-focused lexical competition in which visual speech information affects auditory processing to increase sensitivity to acoustic information through an early integration mechanism, and a late integration stage that incorporates specific information about a speaker's articulators to constrain the number of possible candidates in a spoken utterance. Ultimately it is words compatible with both auditory and visual information that most strongly determine successful speech perception during everyday listening. Thus, audiovisual speech perception is accomplished through multiple stages of integration
Peelle, Jonathan E.; Sommers, Mitchell S.
During face-to-face conversational speech listeners must efficiently process a rapid and complex stream of multisensory information. Visual speech can serve as a critical complement to auditory information because it provides cues to both the timing of the incoming acoustic signal (the amplitude envelope, influencing attention and perceptual sensitivity) and its content (place and manner of articulation, constraining lexical selection). Here we review behavioral and neurophysiological evidence regarding listeners' use of visual speech information. Multisensory integration of audiovisual speech cues improves recognition accuracy, particularly for speech in noise. Even when speech is intelligible based solely on auditory information, adding visual information may reduce the cognitive demands placed on listeners through increasing precision of prediction. Electrophysiological studies demonstrate oscillatory cortical entrainment to speech in auditory cortex is enhanced when visual speech is present, increasing sensitivity to important acoustic cues. Neuroimaging studies also suggest increased activity in auditory cortex when congruent visual information is available, but additionally emphasize the involvement of heteromodal regions of posterior superior temporal sulcus as playing a role in integrative processing. We interpret these findings in a framework of temporally-focused lexical competition in which visual speech information affects auditory processing to increase sensitivity to auditory information through an early integration mechanism, and a late integration stage that incorporates specific information about a speaker's articulators to constrain the number of possible candidates in a spoken utterance. Ultimately it is words compatible with both auditory and visual information that most strongly determine successful speech perception during everyday listening. Thus, audiovisual speech perception is accomplished through multiple stages of integration, supported
Vroomen, Jean; Stekelenburg, Jeroen J.
Perception of intersensory temporal order is particularly difficult for (continuous) audiovisual speech, as perceivers may find it difficult to notice substantial timing differences between speech sounds and lip movements. Here we tested whether this occurs because audiovisual speech is strongly paired ("unity assumption"). Participants made…
Full Text Available A change in talker is a change in the context for the phonetic interpretation of acoustic patterns of speech. Different talkers have different mappings between acoustic patterns and phonetic categories and listeners need to adapt to these differences. Despite this complexity, listeners are adept at comprehending speech in multiple-talker contexts, albeit at a slight but measurable performance cost (e.g., slower recognition. So far, this talker-variability cost has been demonstrated only in audio-only speech. Other research in single-talker contexts have shown, however, that when listeners are able to see a talker’s face, speech recognition is improved under adverse listening (e.g., noise or distortion conditions that can increase uncertainty in the mapping between acoustic patterns and phonetic categories. Does seeing a talker's face reduce the cost of word recognition in multiple-talker contexts? We used a speeded word-monitoring task in which listeners make quick judgments about target-word recognition in single- and multiple-talker contexts. Results show faster recognition performance in single-talker conditions compared to multiple-talker conditions for both audio-only and audio-visual speech. However, recognition time in a multiple-talker context was slower in the audio-visual condition compared to audio-only condition. These results suggest that seeing a talker’s face during speech perception may slow recognition by increasing the importance of talker identification, signaling to the listener a change in talker has occurred.
Eskelund, Kasper; Dau, Torsten
Speech perception integrates signal from ear and eye. This is witnessed by a wide range of audiovisual integration effects, such as ventriloquism and the McGurk illusion. Some behavioral evidence suggest that audiovisual integration of specific aspects is special for speech perception. However, our...... knowledge of such bimodal integration would be strengthened if the phenomena could be investigated by objective, neutrally based methods. One key question of the present work is if perceptual processing of audiovisual speech can be gauged with a specific signature of neurophysiological activity...... on the auditory speech percept? In two experiments, which both combine behavioral and neurophysiological measures, an uncovering of the relation between perception of faces and of audiovisual integration is attempted. Behavioral findings suggest a strong effect of face perception, whereas the MMN results are less...
Andersen, Tobias; Tiippana, K.; Laarni, J.
Auditory and visual information is integrated when perceiving speech, as evidenced by the McGurk effect in which viewing an incongruent talking face categorically alters auditory speech perception. Audiovisual integration in speech perception has long been considered automatic and pre-attentive b......Auditory and visual information is integrated when perceiving speech, as evidenced by the McGurk effect in which viewing an incongruent talking face categorically alters auditory speech perception. Audiovisual integration in speech perception has long been considered automatic and pre...... from each of the faces and from the voice on the auditory speech percept. We found that directing visual spatial attention towards a face increased the influence of that face on auditory perception. However, the influence of the voice on auditory perception did not change suggesting that audiovisual...... integration did not change. Visual spatial attention was also able to select between the faces when lip reading. This suggests that visual spatial attention acts at the level of visual speech perception prior to audiovisual integration and that the effect propagates through audiovisual integration...
Eskelund, Kasper; Andersen, Tobias
Speech perception is audiovisual as evidenced by bimodal integration in the McGurk effect. This integration effect may be specific to speech or be applied to all stimuli in general. To investigate this, Tuomainen et al. (2005) used sine-wave speech, which naÃ¯ve observers may perceive as non......-speech, but hear as speech once informed of the linguistic origin of the signal. Combinations of sine-wave speech and incongruent video of the talker elicited a McGurk effect only for informed observers. This indicates that the audiovisual integration effect is specific to speech perception. However, observers...... that observers did look near the mouth. We conclude that eye-movements did not influence the results of Tuomainen et al. and that their results thus can be taken as evidence of a speech specific mode of audiovisual integration underlying the McGurk illusion....
Dias, James W.; Cook, Theresa C.; Rosenblum, Lawrence D.
Research suggests that selective adaptation in speech is a low-level process dependent on sensory-specific information shared between the adaptor and test-stimuli. However, previous research has only examined how adaptors shift perception of unimodal test stimuli, either auditory or visual. In the current series of experiments, we investigated whether adaptation to cross-sensory phonetic information can influence perception of integrated audio-visual phonetic information. We examined how selective adaptation to audio and visual adaptors shift perception of speech along an audiovisual test continuum. This test-continuum consisted of nine audio-/ba/-visual-/va/ stimuli, ranging in visual clarity of the mouth. When the mouth was clearly visible, perceivers “heard” the audio-visual stimulus as an integrated “va” percept 93.7% of the time (e.g., McGurk & MacDonald, 1976). As visibility of the mouth became less clear across the nine-item continuum, the audio-visual “va” percept weakened, resulting in a continuum ranging in audio-visual percepts from /va/ to /ba/. Perception of the test-stimuli was tested before and after adaptation. Changes in audiovisual speech perception were observed following adaptation to visual-/va/ and audiovisual-/va/, but not following adaptation to auditory-/va/, auditory-/ba/, or visual-/ba/. Adaptation modulates perception of integrated audio-visual speech by modulating the processing of sensory-specific information. The results suggest that auditory and visual speech information are not completely integrated at the level of selective adaptation. PMID:27041781
Lalonde, Kaylah; Holt, Rachael Frush
This study used the auditory evaluation framework [Erber (1982). Auditory Training (Alexander Graham Bell Association, Washington, DC)] to characterize the influence of visual speech on audiovisual (AV) speech perception in adults and children at multiple levels of perceptual processing. Six- to eight-year-old children and adults completed auditory and AV speech perception tasks at three levels of perceptual processing (detection, discrimination, and recognition). The tasks differed in the le...
Asakawa, Kaori; Tanaka, Akihiro; Imai, Hisato
We investigated whether audiovisual synchrony perception for speech could change after observation of the audiovisual temporal mismatch. Previous studies have revealed that audiovisual synchrony perception is re-calibrated after exposure to a constant timing difference between auditory and visual signals in non-speech. In the present study, we examined whether this audiovisual temporal recalibration occurs at the perceptual level even for speech (monosyllables). In Experiment 1, participants performed an audiovisual simultaneity judgment task (i.e., a direct measurement of the audiovisual synchrony perception) in terms of the speech signal after observation of the speech stimuli which had a constant audiovisual lag. The results showed that the “simultaneous” responses (i.e., proportion of responses for which participants judged the auditory and visual stimuli to be synchronous) at least partly depended on exposure lag. In Experiment 2, we adopted the McGurk identification task (i.e., an indirect measurement of the audiovisual synchrony perception) to exclude the possibility that this modulation of synchrony perception was solely attributable to the response strategy using stimuli identical to those of Experiment 1. The characteristics of the McGurk effect reported by participants depended on exposure lag. Thus, it was shown that audiovisual synchrony perception for speech could be modulated following exposure to constant lag both in direct and indirect measurement. Our results suggest that temporal recalibration occurs not only in non-speech signals but also in monosyllabic speech at the perceptual level.
Jaekl, Philip; Pesquita, Ana; Alsius, Agnes; Munhall, Kevin; Soto-Faraco, Salvador
Seeing a speaker's facial gestures can significantly improve speech comprehension, especially in noisy environments. However, the nature of the visual information from the speaker's facial movements that is relevant for this enhancement is still unclear. Like auditory speech signals, visual speech signals unfold over time and contain both dynamic configural information and luminance-defined local motion cues; two information sources that are thought to engage anatomically and functionally separate visual systems. Whereas, some past studies have highlighted the importance of local, luminance-defined motion cues in audiovisual speech perception, the contribution of dynamic configural information signalling changes in form over time has not yet been assessed. We therefore attempted to single out the contribution of dynamic configural information to audiovisual speech processing. To this aim, we measured word identification performance in noise using unimodal auditory stimuli, and with audiovisual stimuli. In the audiovisual condition, speaking faces were presented as point light displays achieved via motion capture of the original talker. Point light displays could be isoluminant, to minimise the contribution of effective luminance-defined local motion information, or with added luminance contrast, allowing the combined effect of dynamic configural cues and local motion cues. Audiovisual enhancement was found in both the isoluminant and contrast-based luminance conditions compared to an auditory-only condition, demonstrating, for the first time the specific contribution of dynamic configural cues to audiovisual speech improvement. These findings imply that globally processed changes in a speaker's facial shape contribute significantly towards the perception of articulatory gestures and the analysis of audiovisual speech. Copyright © 2015 Elsevier Ltd. All rights reserved.
Eskelund, Kasper; Andersen, Tobias
Speech perception is audiovisual as evidenced by the McGurk effect in which watching incongruent articulatory mouth movements can change the phonetic auditory speech percept. This type of audiovisual integration may be specific to speech or be applied to all stimuli in general. To investigate...... of audiovisual integration specific to speech perception. However, the results of Tuomainen et al. might have been influenced by another effect. When observers were naïve, they had little motivation to look at the face. When informed, they knew that the face was relevant for the task and this could increase...... visual detection task. In our first experiment, observers presented with congruent and incongruent audiovisual sine-wave speech stimuli did only show a McGurk effect when informed of the speech nature of the stimulus. Performance on the secondary visual task was very good, thus supporting the finding...
Getz, Laura M; Nordeen, Elke R; Vrabic, Sarah C; Toscano, Joseph C
Adult speech perception is generally enhanced when information is provided from multiple modalities. In contrast, infants do not appear to benefit from combining auditory and visual speech information early in development. This is true despite the fact that both modalities are important to speech comprehension even at early stages of language acquisition. How then do listeners learn how to process auditory and visual information as part of a unified signal? In the auditory domain, statistical learning processes provide an excellent mechanism for acquiring phonological categories. Is this also true for the more complex problem of acquiring audiovisual correspondences, which require the learner to integrate information from multiple modalities? In this paper, we present simulations using Gaussian mixture models (GMMs) that learn cue weights and combine cues on the basis of their distributional statistics. First, we simulate the developmental process of acquiring phonological categories from auditory and visual cues, asking whether simple statistical learning approaches are sufficient for learning multi-modal representations. Second, we use this time course information to explain audiovisual speech perception in adult perceivers, including cases where auditory and visual input are mismatched. Overall, we find that domain-general statistical learning techniques allow us to model the developmental trajectory of audiovisual cue integration in speech, and in turn, allow us to better understand the mechanisms that give rise to unified percepts based on multiple cues.
Lalonde, Kaylah; Holt, Rachael Frush
This study used the auditory evaluation framework [Erber (1982). Auditory Training (Alexander Graham Bell Association, Washington, DC)] to characterize the influence of visual speech on audiovisual (AV) speech perception in adults and children at multiple levels of perceptual processing. Six- to eight-year-old children and adults completed auditory and AV speech perception tasks at three levels of perceptual processing (detection, discrimination, and recognition). The tasks differed in the level of perceptual processing required to complete them. Adults and children demonstrated visual speech influence at all levels of perceptual processing. Whereas children demonstrated the same visual speech influence at each level of perceptual processing, adults demonstrated greater visual speech influence on tasks requiring higher levels of perceptual processing. These results support previous research demonstrating multiple mechanisms of AV speech processing (general perceptual and speech-specific mechanisms) with independent maturational time courses. The results suggest that adults rely on both general perceptual mechanisms that apply to all levels of perceptual processing and speech-specific mechanisms that apply when making phonetic decisions and/or accessing the lexicon. Six- to eight-year-old children seem to rely only on general perceptual mechanisms across levels. As expected, developmental differences in AV benefit on this and other recognition tasks likely reflect immature speech-specific mechanisms and phonetic processing in children.
Saalasti, Satu; Katsyri, Jari; Tiippana, Kaisa; Laine-Hernandez, Mari; von Wendt, Lennart; Sams, Mikko
Audiovisual speech perception was studied in adults with Asperger syndrome (AS), by utilizing the McGurk effect, in which conflicting visual articulation alters the perception of heard speech. The AS group perceived the audiovisual stimuli differently from age, sex and IQ matched controls. When a voice saying /p/ was presented with a face…
Wilson, Amanda H.; Alsius, Agnès; Parè, Martin; Munhall, Kevin G.
Purpose: The aim of this article is to examine the effects of visual image degradation on performance and gaze behavior in audiovisual and visual-only speech perception tasks. Method: We presented vowel-consonant-vowel utterances visually filtered at a range of frequencies in visual-only, audiovisual congruent, and audiovisual incongruent…
Danielson, D Kyle; Bruderer, Alison G; Kandhadai, Padmapriya; Vatikiotis-Bateson, Eric; Werker, Janet F
The period between six and 12 months is a sensitive period for language learning during which infants undergo auditory perceptual attunement, and recent results indicate that this sensitive period may exist across sensory modalities. We tested infants at three stages of perceptual attunement (six, nine, and 11 months) to determine 1) whether they were sensitive to the congruence between heard and seen speech stimuli in an unfamiliar language, and 2) whether familiarization with congruent audiovisual speech could boost subsequent non-native auditory discrimination. Infants at six- and nine-, but not 11-months, detected audiovisual congruence of non-native syllables. Familiarization to incongruent, but not congruent, audiovisual speech changed auditory discrimination at test for six-month-olds but not nine- or 11-month-olds. These results advance the proposal that speech perception is audiovisual from early in ontogeny, and that the sensitive period for audiovisual speech perception may last somewhat longer than that for auditory perception alone.
Barkhuysen, Pashiera; Krahmer, Emiel; Swerts, Marc
In this article we report on two experiments about the perception of audiovisual cues to emotional speech. The article addresses two questions: 1) how do visual cues from a speaker's face to emotion relate to auditory cues, and (2) what is the recognition speed for various facial cues to emotion? Both experiments reported below are based on tests with video clips of emotional utterances collected via a variant of the well-known Velten method. More specifically, we recorded speakers who displayed positive or negative emotions, which were congruent or incongruent with the (emotional) lexical content of the uttered sentence. In order to test this, we conducted two experiments. The first experiment is a perception experiment in which Czech participants, who do not speak Dutch, rate the perceived emotional state of Dutch speakers in a bimodal (audiovisual) or a unimodal (audio- or vision-only) condition. It was found that incongruent emotional speech leads to significantly more extreme perceived emotion scores than congruent emotional speech, where the difference between congruent and incongruent emotional speech is larger for the negative than for the positive conditions. Interestingly, the largest overall differences between congruent and incongruent emotions were found for the audio-only condition, which suggests that posing an incongruent emotion has a particularly strong effect on the spoken realization of emotions. The second experiment uses a gating paradigm to test the recognition speed for various emotional expressions from a speaker's face. In this experiment participants were presented with the same clips as experiment I, but this time presented vision-only. The clips were shown in successive segments (gates) of increasing duration. Results show that participants are surprisingly accurate in their recognition of the various emotions, as they already reach high recognition scores in the first gate (after only 160 ms). Interestingly, the recognition scores
Full Text Available Seeing articulatory movements influences perception of auditory speech. This is often reflected in a shortened latency of auditory event-related potentials (ERPs generated in the auditory cortex. The present study addressed whether this early neural correlate of audiovisual interaction is modulated by attention. We recorded ERPs in 15 subjects while they were presented with auditory, visual and audiovisual spoken syllables. Audiovisual stimuli consisted of incongruent auditory and visual components known to elicit a McGurk effect, i.e. a visually driven alteration in the auditory speech percept. In a Dual task condition, participants were asked to identify spoken syllables whilst monitoring a rapid visual stream of pictures for targets, i.e., they had to divide their attention. In a Single task condition, participants identified the syllables without any other tasks, i.e., they were asked to ignore the pictures and focus their attention fully on the spoken syllables. The McGurk effect was weaker in the Dual task than in the Single task condition, indicating an effect of attentional load on audiovisual speech perception. Early auditory ERP components, N1 and P2, peaked earlier to audiovisual stimuli than to auditory stimuli when attention was fully focused on syllables, indicating neurophysiological audiovisual interaction. This latency decrement was reduced when attention was loaded, suggesting that attention influences early neural processing of audiovisual speech. We conclude that reduced attention weakens the interaction between vision and audition in speech.
Alsius, Agnès; Möttönen, Riikka; Sams, Mikko E; Soto-Faraco, Salvador; Tiippana, Kaisa
Seeing articulatory movements influences perception of auditory speech. This is often reflected in a shortened latency of auditory event-related potentials (ERPs) generated in the auditory cortex. The present study addressed whether this early neural correlate of audiovisual interaction is modulated by attention. We recorded ERPs in 15 subjects while they were presented with auditory, visual, and audiovisual spoken syllables. Audiovisual stimuli consisted of incongruent auditory and visual components known to elicit a McGurk effect, i.e., a visually driven alteration in the auditory speech percept. In a Dual task condition, participants were asked to identify spoken syllables whilst monitoring a rapid visual stream of pictures for targets, i.e., they had to divide their attention. In a Single task condition, participants identified the syllables without any other tasks, i.e., they were asked to ignore the pictures and focus their attention fully on the spoken syllables. The McGurk effect was weaker in the Dual task than in the Single task condition, indicating an effect of attentional load on audiovisual speech perception. Early auditory ERP components, N1 and P2, peaked earlier to audiovisual stimuli than to auditory stimuli when attention was fully focused on syllables, indicating neurophysiological audiovisual interaction. This latency decrement was reduced when attention was loaded, suggesting that attention influences early neural processing of audiovisual speech. We conclude that reduced attention weakens the interaction between vision and audition in speech.
Knowland, Victoria C. P.; Mercure, Evelyne; Karmiloff-Smith, Annette; Dick, Fred; Thomas, Michael S. C.
Being able to see a talking face confers a considerable advantage for speech perception in adulthood. However, behavioural data currently suggest that children fail to make full use of these available visual speech cues until age 8 or 9. This is particularly surprising given the potential utility of multiple informational cues during language…
Hisanaga, Satoko; Sekiyama, Kaoru; Igasaki, Tomohiko; Murayama, Nobuki
Several behavioural studies have shown that the interplay between voice and face information in audiovisual speech perception is not universal. Native English speakers (ESs) are influenced by visual mouth movement to a greater degree than native Japanese speakers (JSs) when listening to speech. However, the biological basis of these group differences is unknown. Here, we demonstrate the time-varying processes of group differences in terms of event-related brain potentials (ERP) and eye gaze for audiovisual and audio-only speech perception. On a behavioural level, while congruent mouth movement shortened the ESs' response time for speech perception, the opposite effect was observed in JSs. Eye-tracking data revealed a gaze bias to the mouth for the ESs but not the JSs, especially before the audio onset. Additionally, the ERP P2 amplitude indicated that ESs processed multisensory speech more efficiently than auditory-only speech; however, the JSs exhibited the opposite pattern. Taken together, the ESs' early visual attention to the mouth was likely to promote phonetic anticipation, which was not the case for the JSs. These results clearly indicate the impact of language and/or culture on multisensory speech processing, suggesting that linguistic/cultural experiences lead to the development of unique neural systems for audiovisual speech perception.
Altvater-Mackensen, Nicole; Mani, Nivedita; Grossmann, Tobias
Recent studies suggest that infants' audiovisual speech perception is influenced by articulatory experience (Mugitani et al., 2008; Yeung & Werker, 2013). The current study extends these findings by testing if infants' emerging ability to produce native sounds in babbling impacts their audiovisual speech perception. We tested 44 6-month-olds…
Venezia, Jonathan H; Thurman, Steven M; Matchin, William; George, Sahara E; Hickok, Gregory
Recent influential models of audiovisual speech perception suggest that visual speech aids perception by generating predictions about the identity of upcoming speech sounds. These models place stock in the assumption that visual speech leads auditory speech in time. However, it is unclear whether and to what extent temporally-leading visual speech information contributes to perception. Previous studies exploring audiovisual-speech timing have relied upon psychophysical procedures that require artificial manipulation of cross-modal alignment or stimulus duration. We introduce a classification procedure that tracks perceptually relevant visual speech information in time without requiring such manipulations. Participants were shown videos of a McGurk syllable (auditory /apa/ + visual /aka/ = perceptual /ata/) and asked to perform phoneme identification (/apa/ yes-no). The mouth region of the visual stimulus was overlaid with a dynamic transparency mask that obscured visual speech in some frames but not others randomly across trials. Variability in participants' responses (~35 % identification of /apa/ compared to ~5 % in the absence of the masker) served as the basis for classification analysis. The outcome was a high resolution spatiotemporal map of perceptually relevant visual features. We produced these maps for McGurk stimuli at different audiovisual temporal offsets (natural timing, 50-ms visual lead, and 100-ms visual lead). Briefly, temporally-leading (~130 ms) visual information did influence auditory perception. Moreover, several visual features influenced perception of a single speech sound, with the relative influence of each feature depending on both its temporal relation to the auditory signal and its informational content.
Venezia, Jonathan H.; Thurman, Steven M.; Matchin, William; George, Sahara E.; Hickok, Gregory
Recent influential models of audiovisual speech perception suggest that visual speech aids perception by generating predictions about the identity of upcoming speech sounds. These models place stock in the assumption that visual speech leads auditory speech in time. However, it is unclear whether and to what extent temporally-leading visual speech information contributes to perception. Previous studies exploring audiovisual-speech timing have relied upon psychophysical procedures that require artificial manipulation of cross-modal alignment or stimulus duration. We introduce a classification procedure that tracks perceptually-relevant visual speech information in time without requiring such manipulations. Participants were shown videos of a McGurk syllable (auditory /apa/ + visual /aka/ = perceptual /ata/) and asked to perform phoneme identification (/apa/ yes-no). The mouth region of the visual stimulus was overlaid with a dynamic transparency mask that obscured visual speech in some frames but not others randomly across trials. Variability in participants' responses (∼35% identification of /apa/ compared to ∼5% in the absence of the masker) served as the basis for classification analysis. The outcome was a high resolution spatiotemporal map of perceptually-relevant visual features. We produced these maps for McGurk stimuli at different audiovisual temporal offsets (natural timing, 50-ms visual lead, and 100-ms visual lead). Briefly, temporally-leading (∼130 ms) visual information did influence auditory perception. Moreover, several visual features influenced perception of a single speech sound, with the relative influence of each feature depending on both its temporal relation to the auditory signal and its informational content. PMID:26669309
The goal of this work is to find a way to measure similarity of audiovisual speech percepts. Phoneme-related self-organizing maps (SOM) with a rectangular basis are trained with data material from a (labeled) video film. For the training, a combination of auditory speech features and corresponding....... Dependent on the training data, these other units may also be contextually immediate neighboring units. The poster demonstrates the idea with text material spoken by one individual subject using a set of simple audio-visual features. The data material for the training process consists of 44 labeled...... sentences in German with a balanced phoneme repertoire. As a result it can be stated that (i) the SOM can be trained to map auditory and visual features in a topology-preserving way and (ii) they show strain due to the influence of other audio-visual units. The SOM can be used to measure similarity amongst...
Full Text Available Gender and age have been found to affect adults’ audio-visual (AV speech perception. However, research on adult aging focuses on adults over 60 years, who have an increasing likelihood for cognitive and sensory decline, which may confound positive effects of age-related AV-experience and its interaction with gender. Observed age and gender differences in AV speech perception may also depend on measurement sensitivity and AV task difficulty. Consequently both AV benefit and visual influence were used to measure visual contribution for gender-balanced groups of young (20-30 years and middle-aged adults (50-60 years with task difficulty varied using AV syllables from different talkers in alternative auditory backgrounds. Females had better speech-reading performance than males. Whereas no gender differences in AV benefit or visual influence were observed for young adults, visually influenced responses were significantly greater for middle-aged females than middle-aged males. That speech-reading performance did not influence AV benefit may be explained by visual speech extraction and AV integration constituting independent abilities. Contrastingly, the gender difference in visually influenced responses in middle adulthood may reflect an experience-related shift in females’ general AV perceptual strategy. Although young females’ speech-reading proficiency may not readily contribute to greater visual influence, between young and middle-adulthood recurrent confirmation of the contribution of visual cues induced by speech-reading proficiency may gradually shift females AV perceptual strategy towards more visually dominated responses.
Kim, Heejung; Hahm, Jarang; Lee, Hyekyoung; Kang, Eunjoo; Kang, Hyejin; Lee, Dong Soo
The human brain naturally integrates audiovisual information to improve speech perception. However, in noisy environments, understanding speech is difficult and may require much effort. Although the brain network is supposed to be engaged in speech perception, it is unclear how speech-related brain regions are connected during natural bimodal audiovisual or unimodal speech perception with counterpart irrelevant noise. To investigate the topological changes of speech-related brain networks at all possible thresholds, we used a persistent homological framework through hierarchical clustering, such as single linkage distance, to analyze the connected component of the functional network during speech perception using functional magnetic resonance imaging. For speech perception, bimodal (audio-visual speech cue) or unimodal speech cues with counterpart irrelevant noise (auditory white-noise or visual gum-chewing) were delivered to 15 subjects. In terms of positive relationship, similar connected components were observed in bimodal and unimodal speech conditions during filtration. However, during speech perception by congruent audiovisual stimuli, the tighter couplings of left anterior temporal gyrus-anterior insula component and right premotor-visual components were observed than auditory or visual speech cue conditions, respectively. Interestingly, visual speech is perceived under white noise by tight negative coupling in the left inferior frontal region-right anterior cingulate, left anterior insula, and bilateral visual regions, including right middle temporal gyrus, right fusiform components. In conclusion, the speech brain network is tightly positively or negatively connected, and can reflect efficient or effortful processes during natural audiovisual integration or lip-reading, respectively, in speech perception.
Huyse, Aurélie; Berthommier, Frédéric; Leybaert, Jacqueline
The aim of the present study was to examine audiovisual speech integration in cochlear-implanted children and in normally hearing children exposed to degraded auditory stimuli. Previous studies have shown that speech perception in cochlear-implanted users is biased toward the visual modality when audition and vision provide conflicting information. Our main question was whether an experimentally designed degradation of the visual speech cue would increase the importance of audition in the response pattern. The impact of auditory proficiency was also investigated. A group of 31 children with cochlear implants and a group of 31 normally hearing children matched for chronological age were recruited. All children with cochlear implants had profound congenital deafness and had used their implants for at least 2 years. Participants had to perform an /aCa/ consonant-identification task in which stimuli were presented randomly in three conditions: auditory only, visual only, and audiovisual (congruent and incongruent McGurk stimuli). In half of the experiment, the visual speech cue was normal; in the other half (visual reduction) a degraded visual signal was presented, aimed at preventing lipreading of good quality. The normally hearing children received a spectrally reduced speech signal (simulating the input delivered by the cochlear implant). First, performance in visual-only and in congruent audiovisual modalities were decreased, showing that the visual reduction technique used here was efficient at degrading lipreading. Second, in the incongruent audiovisual trials, visual reduction led to a major increase in the number of auditory based responses in both groups. Differences between proficient and nonproficient children were found in both groups, with nonproficient children's responses being more visual and less auditory than those of proficient children. Further analysis revealed that differences between visually clear and visually reduced conditions and between
integration to speech perception along with three model variations. In early MLE, integration is based on a continuous internal representation before categorization, which can make the model more parsimonious by imposing constraints that reflect experimental designs. The study also shows that cross......Speech perception is facilitated by seeing the articulatory mouth movements of the talker. This is due to perceptual audiovisual integration, which also causes the McGurk−MacDonald illusion, and for which a comprehensive computational account is still lacking. Decades of research have largely......-validation can evaluate models of audiovisual integration based on typical data sets taking both goodness-of-fit and model flexibility into account. All models were tested on a published data set previously used for testing the FLMP. Cross-validation favored the early MLE while more conventional error measures...
Lewkowicz, David J; Minar, Nicholas J; Tift, Amy H; Brandon, Melissa
To investigate the developmental emergence of the perception of the multisensory coherence of native and non-native audiovisual fluent speech, we tested 4-, 8- to 10-, and 12- to 14-month-old English-learning infants. Infants first viewed two identical female faces articulating two different monologues in silence and then in the presence of an audible monologue that matched the visible articulations of one of the faces. Neither the 4-month-old nor 8- to 10-month-old infants exhibited audiovisual matching in that they did not look longer at the matching monologue. In contrast, the 12- to 14-month-old infants exhibited matching and, consistent with the emergence of perceptual expertise for the native language, perceived the multisensory coherence of native-language monologues earlier in the test trials than that of non-native language monologues. Moreover, the matching of native audible and visible speech streams observed in the 12- to 14-month-olds did not depend on audiovisual synchrony, whereas the matching of non-native audible and visible speech streams did depend on synchrony. Overall, the current findings indicate that the perception of the multisensory coherence of fluent audiovisual speech emerges late in infancy, that audiovisual synchrony cues are more important in the perception of the multisensory coherence of non-native speech than that of native audiovisual speech, and that the emergence of this skill most likely is affected by perceptual narrowing. Copyright © 2014 Elsevier Inc. All rights reserved.
van Laarhoven, Thijs; Keetels, Mirjam; Schakel, Lemmy; Vroomen, Jean
Individuals with developmental dyslexia (DD) may experience, besides reading problems, other speech-related processing deficits. Here, we examined the influence of visual articulatory information (lip-read speech) at various levels of background noise on auditory word recognition in children and adults with DD. We found that children with a…
Lewkowicz, David J.; Minar, Nicholas J.; Tift, Amy H.; Brandon, Melissa
To investigate the developmental emergence of the ability to perceive the multisensory coherence of native and non-native audiovisual fluent speech, we tested 4-, 8–10, and 12–14 month-old English-learning infants. Infants first viewed two identical female faces articulating two different monologues in silence and then in the presence of an audible monologue that matched the visible articulations of one of the faces. Neither the 4-month-old nor the 8–10 month-old infants exhibited audio-visual matching in that neither group exhibited greater looking at the matching monologue. In contrast, the 12–14 month-old infants exhibited matching and, consistent with the emergence of perceptual expertise for the native language, they perceived the multisensory coherence of native-language monologues earlier in the test trials than of non-native language monologues. Moreover, the matching of native audible and visible speech streams observed in the 12–14 month olds did not depend on audio-visual synchrony whereas the matching of non-native audible and visible speech streams did depend on synchrony. Overall, the current findings indicate that the perception of the multisensory coherence of fluent audiovisual speech emerges late in infancy, that audio-visual synchrony cues are more important in the perception of the multisensory coherence of non-native than native audiovisual speech, and that the emergence of this skill most likely is affected by perceptual narrowing. PMID:25462038
John F Magnotti
Full Text Available Audiovisual speech integration combines information from auditory speech (talker's voice and visual speech (talker's mouth movements to improve perceptual accuracy. However, if the auditory and visual speech emanate from different talkers, integration decreases accuracy. Therefore, a key step in audiovisual speech perception is deciding whether auditory and visual speech have the same source, a process known as causal inference. A well-known illusion, the McGurk Effect, consists of incongruent audiovisual syllables, such as auditory "ba" + visual "ga" (AbaVga, that are integrated to produce a fused percept ("da". This illusion raises two fundamental questions: first, given the incongruence between the auditory and visual syllables in the McGurk stimulus, why are they integrated; and second, why does the McGurk effect not occur for other, very similar syllables (e.g., AgaVba. We describe a simplified model of causal inference in multisensory speech perception (CIMS that predicts the perception of arbitrary combinations of auditory and visual speech. We applied this model to behavioral data collected from 60 subjects perceiving both McGurk and non-McGurk incongruent speech stimuli. The CIMS model successfully predicted both the audiovisual integration observed for McGurk stimuli and the lack of integration observed for non-McGurk stimuli. An identical model without causal inference failed to accurately predict perception for either form of incongruent speech. The CIMS model uses causal inference to provide a computational framework for studying how the brain performs one of its most important tasks, integrating auditory and visual speech cues to allow us to communicate with others.
Pons, Ferran; Andreu, Llorenç; Sanz-Torrent, Monica; Buil-Legaz, Lucía; Lewkowicz, David J
Speech perception involves the integration of auditory and visual articulatory information, and thus requires the perception of temporal synchrony between this information. There is evidence that children with specific language impairment (SLI) have difficulty with auditory speech perception but it is not known if this is also true for the integration of auditory and visual speech. Twenty Spanish-speaking children with SLI, twenty typically developing age-matched Spanish-speaking children, and twenty Spanish-speaking children matched for MLU-w participated in an eye-tracking study to investigate the perception of audiovisual speech synchrony. Results revealed that children with typical language development perceived an audiovisual asynchrony of 666 ms regardless of whether the auditory or visual speech attribute led the other one. Children with SLI only detected the 666 ms asynchrony when the auditory component preceded [corrected] the visual component. None of the groups perceived an audiovisual asynchrony of 366 ms. These results suggest that the difficulty of speech processing by children with SLI would also involve difficulties in integrating auditory and visual aspects of speech perception.
Full Text Available Previous studies have found that perception in older people benefits from multisensory over uni-sensory information. As normal speech recognition is affected by both the auditory input and the visual lip-movements of the speaker, we investigated the efficiency of audio and visual integration in an older population by manipulating the relative reliability of the auditory and visual information in speech. We also investigated the role of the semantic context of the sentence to assess whether audio-visual integration is affected by top-down semantic processing. We presented participants with audio-visual sentences in which the visual component was either blurred or not blurred. We found that there was a greater cost in recall performance for semantically meaningless speech in the audio-visual blur compared to audio-visual no blur condition and this effect was specific to the older group. Our findings have implications for understanding how aging affects efficient multisensory integration for the perception of speech and suggests that multisensory inputs may benefit speech perception in older adults when the semantic content of the speech is unpredictable.
Buchan, Julie N; Paré, Martin; Munhall, Kevin G
During face-to-face conversation the face provides auditory and visual linguistic information, and also conveys information about the identity of the speaker. This study investigated behavioral strategies involved in gathering visual information while watching talking faces. The effects of varying talker identity and varying the intelligibility of speech (by adding acoustic noise) on gaze behavior were measured with an eyetracker. Varying the intelligibility of the speech by adding noise had a noticeable effect on the location and duration of fixations. When noise was present subjects adopted a vantage point that was more centralized on the face by reducing the frequency of the fixations on the eyes and mouth and lengthening the duration of their gaze fixations on the nose and mouth. Varying talker identity resulted in a more modest change in gaze behavior that was modulated by the intelligibility of the speech. Although subjects generally used similar strategies to extract visual information in both talker variability conditions, when noise was absent there were more fixations on the mouth when viewing a different talker every trial as opposed to the same talker every trial. These findings provide a useful baseline for studies examining gaze behavior during audiovisual speech perception and perception of dynamic faces.
Baart, Martijn; Lindborg, Alma; Andersen, Tobias S
Incongruent audiovisual speech stimuli can lead to perceptual illusions such as fusions or combinations. Here, we investigated the underlying audiovisual integration process by measuring ERPs. We observed that visual speech-induced suppression of P2 amplitude (which is generally taken as a measure of audiovisual integration) for fusions was similar to suppression obtained with fully congruent stimuli, whereas P2 suppression for combinations was larger. We argue that these effects arise because the phonetic incongruency is solved differently for both types of stimuli. © 2017 The Authors. European Journal of Neuroscience published by Federation of European Neuroscience Societies and John Wiley & Sons Ltd.
Baart, Martijn; Lindborg, Alma Cornelia; Andersen, Tobias S
Incongruent audiovisual speech stimuli can lead to perceptual illusions such as fusions or combinations. Here, we investigated the underlying audiovisual integration process by measuring ERPs. We observed that visual speech-induced suppression of P2 amplitude (which is generally taken as a measure...... of audiovisual integration) for fusions was comparable to suppression obtained with fully congruent stimuli, whereas P2 suppression for combinations was larger. We argue that these effects arise because the phonetic incongruency is solved differently for both types of stimuli. This article is protected...
Bernstein, Lynne E; Auer, Edward T; Eberhardt, Silvio P; Jiang, Jintao
Speech perception under audiovisual (AV) conditions is well known to confer benefits to perception such as increased speed and accuracy. Here, we investigated how AV training might benefit or impede auditory perceptual learning of speech degraded by vocoding. In Experiments 1 and 3, participants learned paired associations between vocoded spoken nonsense words and nonsense pictures. In Experiment 1, paired-associates (PA) AV training of one group of participants was compared with audio-only (AO) training of another group. When tested under AO conditions, the AV-trained group was significantly more accurate than the AO-trained group. In addition, pre- and post-training AO forced-choice consonant identification with untrained nonsense words showed that AV-trained participants had learned significantly more than AO participants. The pattern of results pointed to their having learned at the level of the auditory phonetic features of the vocoded stimuli. Experiment 2, a no-training control with testing and re-testing on the AO consonant identification, showed that the controls were as accurate as the AO-trained participants in Experiment 1 but less accurate than the AV-trained participants. In Experiment 3, PA training alternated AV and AO conditions on a list-by-list basis within participants, and training was to criterion (92% correct). PA training with AO stimuli was reliably more effective than training with AV stimuli. We explain these discrepant results in terms of the so-called "reverse hierarchy theory" of perceptual learning and in terms of the diverse multisensory and unisensory processing resources available to speech perception. We propose that early AV speech integration can potentially impede auditory perceptual learning; but visual top-down access to relevant auditory features can promote auditory perceptual learning.
Kumar, G Vinodh; Halder, Tamesh; Jaiswal, Amit K; Mukherjee, Abhishek; Roy, Dipanjan; Banerjee, Arpan
Observable lip movements of the speaker influence perception of auditory speech. A classical example of this influence is reported by listeners who perceive an illusory (cross-modal) speech sound (McGurk-effect) when presented with incongruent audio-visual (AV) speech stimuli. Recent neuroimaging studies of AV speech perception accentuate the role of frontal, parietal, and the integrative brain sites in the vicinity of the superior temporal sulcus (STS) for multisensory speech perception. However, if and how does the network across the whole brain participates during multisensory perception processing remains an open question. We posit that a large-scale functional connectivity among the neural population situated in distributed brain sites may provide valuable insights involved in processing and fusing of AV speech. Varying the psychophysical parameters in tandem with electroencephalogram (EEG) recordings, we exploited the trial-by-trial perceptual variability of incongruent audio-visual (AV) speech stimuli to identify the characteristics of the large-scale cortical network that facilitates multisensory perception during synchronous and asynchronous AV speech. We evaluated the spectral landscape of EEG signals during multisensory speech perception at varying AV lags. Functional connectivity dynamics for all sensor pairs was computed using the time-frequency global coherence, the vector sum of pairwise coherence changes over time. During synchronous AV speech, we observed enhanced global gamma-band coherence and decreased alpha and beta-band coherence underlying cross-modal (illusory) perception compared to unisensory perception around a temporal window of 300-600 ms following onset of stimuli. During asynchronous speech stimuli, a global broadband coherence was observed during cross-modal perception at earlier times along with pre-stimulus decreases of lower frequency power, e.g., alpha rhythms for positive AV lags and theta rhythms for negative AV lags. Thus, our
Sarah H Baum
Full Text Available Older adults exhibit decreased performance and increased trial-to-trial variability on a range of cognitive tasks, including speech perception. We used blood oxygen level dependent functional magnetic resonance imaging (BOLD fMRI to search for neural correlates of these behavioral phenomena. We compared brain responses to simple speech stimuli (audiovisual syllables in 24 healthy older adults (53 to 70 years old and 14 younger adults (23 to 39 years old using two independent analysis strategies: region-of-interest (ROI and voxel-wise whole-brain analysis. While mean response amplitudes were moderately greater in younger adults, older adults had much greater within-subject variability. The greatly increased variability in older adults was observed for both individual voxels in the whole-brain analysis and for ROIs in the left superior temporal sulcus, the left auditory cortex, and the left visual cortex. Increased variability in older adults could not be attributed to differences in head movements between the groups. Increased neural variability may be related to the performance declines and increased behavioral variability that occur with aging.
Williams, Joshua T; Darcy, Isabelle; Newman, Sharlene D
The aim of the present study was to characterize effects of learning a sign language on the processing of a spoken language. Specifically, audiovisual phoneme comprehension was assessed before and after 13 weeks of sign language exposure. L2 ASL learners performed this task in the fMRI scanner. Results indicated that L2 American Sign Language (ASL) learners' behavioral classification of the speech sounds improved with time compared to hearing nonsigners. Results indicated increased activation in the supramarginal gyrus (SMG) after sign language exposure, which suggests concomitant increased phonological processing of speech. A multiple regression analysis indicated that learner's rating on co-sign speech use and lipreading ability was correlated with SMG activation. This pattern of results indicates that the increased use of mouthing and possibly lipreading during sign language acquisition may concurrently improve audiovisual speech processing in budding hearing bimodal bilinguals. Copyright © 2015 Elsevier B.V. All rights reserved.
Full Text Available (1 To evaluate the recognition of words, phonemes and lexical tones in audiovisual (AV and auditory-only (AO modes in Mandarin-speaking adults with cochlear implants (CIs; (2 to understand the effect of presentation levels on AV speech perception; (3 to learn the effect of hearing experience on AV speech perception.Thirteen deaf adults (age = 29.1±13.5 years; 8 male, 5 female who had used CIs for >6 months and 10 normal-hearing (NH adults participated in this study. Seven of them were prelingually deaf, and 6 postlingually deaf. The Mandarin Monosyllablic Word Recognition Test was used to assess recognition of words, phonemes and lexical tones in AV and AO conditions at 3 presentation levels: speech detection threshold (SDT, speech recognition threshold (SRT and 10 dB SL (re:SRT.The prelingual group had better phoneme recognition in the AV mode than in the AO mode at SDT and SRT (both p = 0.016, and so did the NH group at SDT (p = 0.004. Mode difference was not noted in the postlingual group. None of the groups had significantly different tone recognition in the 2 modes. The prelingual and postlingual groups had significantly better phoneme and tone recognition than the NH one at SDT in the AO mode (p = 0.016 and p = 0.002 for phonemes; p = 0.001 and p<0.001 for tones but were outperformed by the NH group at 10 dB SL (re:SRT in both modes (both p<0.001 for phonemes; p<0.001 and p = 0.002 for tones. The recognition scores had a significant correlation with group with age and sex controlled (p<0.001.Visual input may help prelingually deaf implantees to recognize phonemes but may not augment Mandarin tone recognition. The effect of presentation level seems minimal on CI users' AV perception. This indicates special considerations in developing audiological assessment protocols and rehabilitation strategies for implantees who speak tonal languages.
Stevenson, Ryan A.; Siemann, Justin K.; Woynaroski, Tiffany G.; Schneider, Brittany C.; Eberly, Haley E.; Camarata, Stephen M.; Wallace, Mark T.
Atypical communicative abilities are a core marker of Autism Spectrum Disorders (ASD). A number of studies have shown that, in addition to auditory comprehension differences, individuals with autism frequently show atypical responses to audiovisual speech, suggesting a multisensory contribution to these communicative differences from their…
Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias S
Speech perception integrates auditory and visual information. This is evidenced by the McGurk illusion where seeing the talking face influences the auditory phonetic percept and by the audiovisual detection advantage where seeing the talking face influences the detectability of the acoustic speech signal. Here, we show that identification of phonetic content and detection can be dissociated as speech-specific and non-specific audiovisual integration effects. To this end, we employed synthetically modified stimuli, sine wave speech (SWS), which is an impoverished speech signal that only observers informed of its speech-like nature recognize as speech. While the McGurk illusion only occurred for informed observers, the audiovisual detection advantage occurred for naïve observers as well. This finding supports a multistage account of audiovisual integration of speech in which the many attributes of the audiovisual speech signal are integrated by separate integration processes.
Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias
Seeing the talker’s articulatory mouth movements can influence the auditory speech percept both in speech identification and detection tasks. Here we show that these audiovisual integration effects also occur for sine wave speech (SWS), which is an impoverished speech signal that naïve observers...... often fail to perceive as speech. While audiovisual integration in the identification task only occurred when observers were informed of the speech-like nature of SWS, integration occurred in the detection task both for informed and naïve observers. This shows that both speech-specific and general...... mechanisms underlie audiovisual integration of speech....
Stevenson, Ryan A; Nelms, Caitlin E; Baum, Sarah H; Zurkovsky, Lilia; Barense, Morgan D; Newhouse, Paul A; Wallace, Mark T
Over the next 2 decades, a dramatic shift in the demographics of society will take place, with a rapid growth in the population of older adults. One of the most common complaints with healthy aging is a decreased ability to successfully perceive speech, particularly in noisy environments. In such noisy environments, the presence of visual speech cues (i.e., lip movements) provide striking benefits for speech perception and comprehension, but previous research suggests that older adults gain less from such audiovisual integration than their younger peers. To determine at what processing level these behavioral differences arise in healthy-aging populations, we administered a speech-in-noise task to younger and older adults. We compared the perceptual benefits of having speech information available in both the auditory and visual modalities and examined both phoneme and whole-word recognition across varying levels of signal-to-noise ratio. For whole-word recognition, older adults relative to younger adults showed greater multisensory gains at intermediate SNRs but reduced benefit at low SNRs. By contrast, at the phoneme level both younger and older adults showed approximately equivalent increases in multisensory gain as signal-to-noise ratio decreased. Collectively, the results provide important insights into both the similarities and differences in how older and younger adults integrate auditory and visual speech cues in noisy environments and help explain some of the conflicting findings in previous studies of multisensory speech perception in healthy aging. These novel findings suggest that audiovisual processing is intact at more elementary levels of speech perception in healthy-aging populations and that deficits begin to emerge only at the more complex word-recognition level of speech signals. Copyright © 2015 Elsevier Inc. All rights reserved.
Leybaert, Jacqueline; Macchi, Lucie; Huyse, Aurélie; Champoux, François; Bayard, Clémence; Colin, Cécile; Berthommier, Frédéric
Audiovisual speech perception of children with specific language impairment (SLI) and children with typical language development (TLD) was compared in two experiments using /aCa/ syllables presented in the context of a masking release paradigm. Children had to repeat syllables presented in auditory alone, visual alone (speechreading), audiovisual congruent and incongruent (McGurk) conditions. Stimuli were masked by either stationary (ST) or amplitude modulated (AM) noise. Although children with SLI were less accurate in auditory and audiovisual speech perception, they showed similar auditory masking release effect than children with TLD. Children with SLI also had less correct responses in speechreading than children with TLD, indicating impairment in phonemic processing of visual speech information. In response to McGurk stimuli, children with TLD showed more fusions in AM noise than in ST noise, a consequence of the auditory masking release effect and of the influence of visual information. Children with SLI did not show this effect systematically, suggesting they were less influenced by visual speech. However, when the visual cues were easily identified, the profile of responses to McGurk stimuli was similar in both groups, suggesting that children with SLI do not suffer from an impairment of audiovisual integration. An analysis of percent of information transmitted revealed a deficit in the children with SLI, particularly for the place of articulation feature. Taken together, the data support the hypothesis of an intact peripheral processing of auditory speech information, coupled with a supra modal deficit of phonemic categorization in children with SLI. Clinical implications are discussed.
Yamamoto, Ryosuke; Naito, Yasushi; Tona, Risa; Moroto, Saburo; Tamaya, Rinko; Fujiwara, Keizo; Shinohara, Shogo; Takebayashi, Shinji; Kikuchi, Masahiro; Michida, Tetsuhiko
An effect of audio-visual (AV) integration is observed when the auditory and visual stimuli are incongruent (the McGurk effect). In general, AV integration is helpful especially in subjects wearing hearing aids or cochlear implants (CIs). However, the influence of AV integration on spoken word recognition in individuals with bilateral CIs (Bi-CIs) has not been fully investigated so far. In this study, we investigated AV integration in children with Bi-CIs. The study sample included thirty one prelingually deafened children who underwent sequential bilateral cochlear implantation. We assessed their responses to congruent and incongruent AV stimuli with three CI-listening modes: only the 1st CI, only the 2nd CI, and Bi-CIs. The responses were assessed in the whole group as well as in two sub-groups: a proficient group (syllable intelligibility ≥80% with the 1st CI) and a non-proficient group (syllable intelligibility effect in each of the three CI-listening modes. AV integration responses were observed in a subset of incongruent AV stimuli, and the patterns observed with the 1st CI and with Bi-CIs were similar. In the proficient group, the responses with the 2nd CI were not significantly different from those with the 1st CI whereas in the non-proficient group the responses with the 2nd CI were driven by visual stimuli more than those with the 1st CI. Our results suggested that prelingually deafened Japanese children who underwent sequential bilateral cochlear implantation exhibit AV integration abilities, both in monaural listening as well as in binaural listening. We also observed a higher influence of visual stimuli on speech perception with the 2nd CI in the non-proficient group, suggesting that Bi-CIs listeners with poorer speech recognition rely on visual information more compared to the proficient subjects to compensate for poorer auditory input. Nevertheless, poorer quality auditory input with the 2nd CI did not interfere with AV integration with binaural
Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias
Speech perception integrates auditory and visual information. This is evidenced by the McGurk illusion where seeing the talking face influences the auditory phonetic percept and by the audiovisual detection advantage where seeing the talking face influences the detectability of the acoustic speech...... signal. Here we show that identification of phonetic content and detection can be dissociated as speech-specific and non-specific audiovisual integration effects. To this end, we employed synthetically modified stimuli, sine wave speech (SWS), which is an impoverished speech signal that only observers...... informed of its speech-like nature recognize as speech. While the McGurk illusion only occurred for informed observers the audiovisual detection advantage occurred for naïve observers as well. This finding supports a multi-stage account of audiovisual integration of speech in which the many attributes...
Noppeney, Uta; Lee, Hwee Ling
To form a coherent percept of the environment, the brain must integrate sensory signals emanating from a common source but segregate those from different sources. Temporal regularities are prominent cues for multisensory integration, particularly for speech and music perception. In line with models of predictive coding, we suggest that the brain adapts an internal model to the statistical regularities in its environment. This internal model enables cross-sensory and sensorimotor temporal predictions as a mechanism to arbitrate between integration and segregation of signals from different senses. © 2018 New York Academy of Sciences.
Full Text Available Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2-7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver.
Sheffert, Sonya M; Olson, Elizabeth
In this research, we investigated the effects of voice and face information on the perceptual learning of talkers and on long-term memory for spoken words. In the first phase, listeners were trained over several days to identify voices from words presented auditorily or audiovisually. The training data showed that visual information about speakers enhanced voice learning, revealing cross-modal connections in talker processing akin to those observed in speech processing. In the second phase, the listeners completed an auditory or audiovisual word recognition memory test in which equal numbers of words were spoken by familiar and unfamiliar talkers. The data showed that words presented by familiar talkers were more likely to be retrieved from episodic memory, regardless of modality. Together, these findings provide new information about the representational code underlying familiar talker recognition and the role of stimulus familiarity in episodic word recognition.
Andersen, Tobias S.; Starrfelt, Randi
Lesions to Broca's area cause aphasia characterized by a severe impairment of the ability to speak, with comparatively intact speech perception. However, some studies have found effects on speech perception under adverse listening conditions, indicating that Broca's area is also involved in speech perception. While these studies have focused on auditory speech perception other studies have shown that Broca's area is activated by visual speech perception. Furthermore, one preliminary report found that a patient with Broca's aphasia did not experience the McGurk illusion suggesting that an intact Broca's area is necessary for audiovisual integration of speech. Here we describe a patient with Broca's aphasia who experienced the McGurk illusion. This indicates that an intact Broca's area is not necessary for audiovisual integration of speech. The McGurk illusions this patient experienced were atypical, which could be due to Broca's area having a more subtle role in audiovisual integration of speech. The McGurk illusions of a control subject with Wernicke's aphasia were, however, also atypical. This indicates that the atypical McGurk illusions were due to deficits in speech processing that are not specific to Broca's aphasia. PMID:25972819
John F Magnotti
Full Text Available During speech perception, humans integrate auditory information from the voice with visual information from the face. This multisensory integration increases perceptual precision, but only if the two cues come from the same talker; this requirement has been largely ignored by current models of speech perception. We describe a generative model of multisensory speech perception that includes this critical step of determining the likelihood that the voice and face information have a common cause. A key feature of the model is that it is based on a principled analysis of how an observer should solve this causal inference problem using the asynchrony between two cues and the reliability of the cues. This allows the model to make predictions abut the behavior of subjects performing a synchrony judgment task, predictive power that does not exist in other approaches, such as post hoc fitting of Gaussian curves to behavioral data. We tested the model predictions against the performance of 37 subjects performing a synchrony judgment task viewing audiovisual speech under a variety of manipulations, including varying asynchronies, intelligibility, and visual cue reliability. The causal inference model outperformed the Gaussian model across two experiments, providing a better fit to the behavioral data with fewer parameters. Because the causal inference model is derived from a principled understanding of the task, model parameters are directly interpretable in terms of stimulus and subject properties.
Christopher W Bishop
Full Text Available Speech is the most important form of human communication but ambient sounds and competing talkers often degrade its acoustics. Fortunately the brain can use visual information, especially its highly precise spatial information, to improve speech comprehension in noisy environments. Previous studies have demonstrated that audiovisual integration depends strongly on spatiotemporal factors. However, some integrative phenomena such as McGurk interference persist even with gross spatial disparities, suggesting that spatial alignment is not necessary for robust integration of audiovisual place-of-articulation cues. It is therefore unclear how speech-cues interact with audiovisual spatial integration mechanisms. Here, we combine two well established psychophysical phenomena, the McGurk effect and the ventriloquist's illusion, to explore this dependency. Our results demonstrate that conflicting spatial cues may not interfere with audiovisual integration of speech, but conflicting speech-cues can impede integration in space. This suggests a direct but asymmetrical influence between ventral 'what' and dorsal 'where' pathways.
Full Text Available Speech is a means of communication which is intrinsically bimodal: the audio signal originates from the dynamics of the articulators. This paper reviews recent works in the field of audiovisual speech, and more specifically techniques developed to measure the level of correspondence between audio and visual speech. It overviews the most common audio and visual speech front-end processing, transformations performed on audio, visual, or joint audiovisual feature spaces, and the actual measure of correspondence between audio and visual speech. Finally, the use of synchrony measure for biometric identity verification based on talking faces is experimented on the BANCA database.
Maier, Joost X.; Di Luca, Massimiliano; Noppeney, Uta
Combining information from the visual and auditory senses can greatly enhance intelligibility of natural speech. Integration of audiovisual speech signals is robust even when temporal offsets are present between the component signals. In the present study, we characterized the temporal integration window for speech and nonspeech stimuli with…
Petridis, Stavros; Pantic, Maja
Past research on automatic laughter detection has focused mainly on audio-based detection. Here we present an audiovisual approach to distinguishing laughter from speech and we show that integrating the information from audio and video leads to an improved reliability of audiovisual approach in
Megnin-Viggars, Odette; Goswami, Usha
Visual speech inputs can enhance auditory speech information, particularly in noisy or degraded conditions. The natural statistics of audiovisual speech highlight the temporal correspondence between visual and auditory prosody, with lip, jaw, cheek and head movements conveying information about the speech envelope. Low-frequency spatial and…
Full Text Available Although infant speech perception in often studied in isolated modalities, infants' experience with speech is largely multimodal (i.e., speech sounds they hear are accompanied by articulating faces. Across two experiments, we tested infants' sensitivity to the relationship between the auditory and visual components of audiovisual speech in their native (English and non-native (Spanish language. In Experiment 1, infants' looking times were measured during a preferential looking task in which they saw two simultaneous visual speech streams articulating a story, one in English and the other in Spanish, while they heard either the English or the Spanish version of the story. In Experiment 2, looking times from another group of infants were measured as they watched single displays of congruent and incongruent combinations of English and Spanish audio and visual speech streams. Findings demonstrated an age-related increase in looking towards the native relative to non-native visual speech stream when accompanied by the corresponding (native auditory speech. This increase in native language preference did not appear to be driven by a difference in preference for native vs. non-native audiovisual congruence as we observed no difference in looking times at the audiovisual streams in Experiment 2.
Traditionally, second language (L2) instruction has emphasised auditory-based instruction methods. However, this approach is restrictive in the sense that speech perception by humans is not just an auditory phenomenon but a multimodal one, and specifically, a visual one as well. In the past decade, experimental studies have shown that the…
Andersen, Tobias; Starrfelt, Randi
perception. While these studies have focused on auditory speech perception other studies have shown that Broca's area is activated by visual speech perception. Furthermore, one preliminary report found that a patient with Broca's aphasia did not experience the McGurk illusion suggesting that an intact Broca......Lesions to Broca's area cause aphasia characterized by a severe impairment of the ability to speak, with comparatively intact speech perception. However, some studies have found effects on speech perception under adverse listening conditions, indicating that Broca's area is also involved in speech......'s area is necessary for audiovisual integration of speech. Here we describe a patient with Broca's aphasia who experienced the McGurk illusion. This indicates that an intact Broca's area is not necessary for audiovisual integration of speech. The McGurk illusions this patient experienced were atypical...
D'Souza, Dean; D'Souza, Hana; Johnson, Mark H; Karmiloff-Smith, Annette
Typically-developing (TD) infants can construct unified cross-modal percepts, such as a speaking face, by integrating auditory-visual (AV) information. This skill is a key building block upon which higher-level skills, such as word learning, are built. Because word learning is seriously delayed in most children with neurodevelopmental disorders, we assessed the hypothesis that this delay partly results from a deficit in integrating AV speech cues. AV speech integration has rarely been investigated in neurodevelopmental disorders, and never previously in infants. We probed for the McGurk effect, which occurs when the auditory component of one sound (/ba/) is paired with the visual component of another sound (/ga/), leading to the perception of an illusory third sound (/da/ or /tha/). We measured AV integration in 95 infants/toddlers with Down, fragile X, or Williams syndrome, whom we matched on Chronological and Mental Age to 25 TD infants. We also assessed a more basic AV perceptual ability: sensitivity to matching vs. mismatching AV speech stimuli. Infants with Williams syndrome failed to demonstrate a McGurk effect, indicating poor AV speech integration. Moreover, while the TD children discriminated between matching and mismatching AV stimuli, none of the other groups did, hinting at a basic deficit or delay in AV speech processing, which is likely to constrain subsequent language development. Copyright © 2016 Elsevier Inc. All rights reserved.
Kaganovich, Natalya; Schumaker, Jennifer
Previous studies have demonstrated that the presence of visual speech cues reduces the amplitude and latency of the N1 and P2 event-related potential (ERP) components elicited by speech stimuli. However, the developmental trajectory of this effect is not yet fully mapped. We examined ERP responses to auditory, visual, and audiovisual speech in two groups of school-age children (7–8-year-olds and 10–11-year-olds) and in adults. Audiovisual speech led to the attenuation of the N1 and P2 components in all groups of participants, suggesting that the neural mechanisms underlying these effects are functional by early school years. Additionally, while the reduction in N1 was largest over the right scalp, the P2 attenuation was largest over the left and midline scalp. The difference in the hemispheric distribution of the N1 and P2 attenuation supports the idea that these components index at least somewhat disparate neural processes within the context of audiovisual speech perception. PMID:25463815
Richards, Michael D; Goltz, Herbert C; Wong, Agnes M F
Amblyopia is a common developmental sensory disorder that has been extensively and systematically investigated as a unisensory visual impairment. However, its effects are increasingly recognized to extend beyond vision to the multisensory domain. Indeed, amblyopia is associated with altered cross-modal interactions in audiovisual temporal perception, audiovisual spatial perception, and audiovisual speech perception. Furthermore, although the visual impairment in amblyopia is typically unilateral, the multisensory abnormalities tend to persist even when viewing with both eyes. Knowledge of the extent and mechanisms of the audiovisual impairments in amblyopia, however, remains in its infancy. This work aims to review our current understanding of audiovisual processing and integration deficits in amblyopia, and considers the possible mechanisms underlying these abnormalities. Copyright © 2018. Published by Elsevier Ltd.
Maidment, David W; Kang, Hi Jee; Stewart, Hannah J; Amitay, Sygal
The study explored whether visual information improves speech identification in typically developing children with normal hearing when the auditory signal is spectrally degraded. Children (n=69) and adults (n=15) were presented with noise-vocoded sentences from the Children's Co-ordinate Response Measure (Rosen, 2011) in auditory-only or audiovisual conditions. The number of bands was adaptively varied to modulate the degradation of the auditory signal, with the number of bands required for approximately 79% correct identification calculated as the threshold. The youngest children (4- to 5-year-olds) did not benefit from accompanying visual information, in comparison to 6- to 11-year-old children and adults. Audiovisual gain also increased with age in the child sample. The current data suggest that children younger than 6 years of age do not fully utilize visual speech cues to enhance speech perception when the auditory signal is degraded. This evidence not only has implications for understanding the development of speech perception skills in children with normal hearing but may also inform the development of new treatment and intervention strategies that aim to remediate speech perception difficulties in pediatric cochlear implant users.
Van der Burg, Erik; Goodbourn, Patrick T
The brain is adaptive. The speed of propagation through air, and of low-level sensory processing, differs markedly between auditory and visual stimuli; yet the brain can adapt to compensate for the resulting cross-modal delays. Studies investigating temporal recalibration to audiovisual speech have used prolonged adaptation procedures, suggesting that adaptation is sluggish. Here, we show that adaptation to asynchronous audiovisual speech occurs rapidly. Participants viewed a brief clip of an actor pronouncing a single syllable. The voice was either advanced or delayed relative to the corresponding lip movements, and participants were asked to make a synchrony judgement. Although we did not use an explicit adaptation procedure, we demonstrate rapid recalibration based on a single audiovisual event. We find that the point of subjective simultaneity on each trial is highly contingent upon the modality order of the preceding trial. We find compelling evidence that rapid recalibration generalizes across different stimuli, and different actors. Finally, we demonstrate that rapid recalibration occurs even when auditory and visual events clearly belong to different actors. These results suggest that rapid temporal recalibration to audiovisual speech is primarily mediated by basic temporal factors, rather than higher-order factors such as perceived simultaneity and source identity. © 2015 The Author(s) Published by the Royal Society. All rights reserved.
Alsius, Agnès; Navarra, Jordi; Campbell, Ruth; Soto-Faraco, Salvador
One of the most commonly cited examples of human multisensory integration occurs during exposure to natural speech, when the vocal and the visual aspects of the signal are integrated in a unitary percept. Audiovisual association of facial gestures and vocal sounds has been demonstrated in nonhuman primates and in prelinguistic children, arguing for a general basis for this capacity. One critical question, however, concerns the role of attention in such multisensory integration. Although both behavioral and neurophysiological studies have converged on a preattentive conceptualization of audiovisual speech integration, this mechanism has rarely been measured under conditions of high attentional load, when the observers' attention resources are depleted. We tested the extent to which audiovisual integration was modulated by the amount of available attentional resources by measuring the observers' susceptibility to the classic McGurk illusion in a dual-task paradigm. The proportion of visually influenced responses was severely, and selectively, reduced if participants were concurrently performing an unrelated visual or auditory task. In contrast with the assumption that crossmodal speech integration is automatic, our results suggest that these multisensory binding processes are subject to attentional demands.
Barrós-Loscertales, Alfonso; Ventura-Campos, Noelia; Visser, Maya; Alsius, Agnès; Pallier, Christophe; Avila Rivera, César; Soto-Faraco, Salvador
Neuroimaging studies of audiovisual speech processing have exclusively addressed listeners' native language (L1). Yet, several behavioural studies now show that AV processing plays an important role in non-native (L2) speech perception. The current fMRI study measured brain activity during auditory, visual, audiovisual congruent and audiovisual incongruent utterances in L1 and L2. BOLD responses to congruent AV speech in the pSTS were stronger than in either unimodal condition in both L1 and L2. Yet no differences in AV processing were expressed according to the language background in this area. Instead, the regions in the bilateral occipital lobe had a stronger congruency effect on the BOLD response (congruent higher than incongruent) in L2 as compared to L1. According to these results, language background differences are predominantly expressed in these unimodal regions, whereas the pSTS is similarly involved in AV integration regardless of language dominance. Copyright © 2013 Elsevier Inc. All rights reserved.
Tobias Søren Andersen
Full Text Available Lesions to Broca’s area cause aphasia characterised by a severe impairment of the ability to speak, with comparatively intact speech perception. However, some studies have found effects on speech perception under adverse listening conditions, indicating that Broca’s area is also involved in speech perception. While these studies have focused on auditory speech perception other studies have shown that Broca’s area is activated by visual speech perception. Furthermore, one preliminary report found that a patient with Broca’s aphasia did not experience the McGurk illusion suggesting that an intact Broca’s area is necessary for audiovisual integration of speech. Here we describe a patient with Broca’s aphasia who experienced the McGurk illusion. This indicates that an intact Broca’s area is not necessary for audiovisual integration of speech. The McGurk illusions this patient experienced were atypical, which could be due to Broca’s area having a more subtle role in audiovisual integration of speech. The McGurk illusions of a control subject with Wernicke’s aphasia were, however, also atypical. This indicates that the atypical McGurk illusions were due to deficits in speech processing that are not specific to Broca’s aphasia.
Altieri, Nicholas; Wenger, Michael J
Speech perception engages both auditory and visual modalities. Limitations of traditional accuracy-only approaches in the investigation of audiovisual speech perception have motivated the use of new methodologies. In an audiovisual speech identification task, we utilized capacity (Townsend and Nozawa, 1995), a dynamic measure of efficiency, to quantify audiovisual integration. Capacity was used to compare RT distributions from audiovisual trials to RT distributions from auditory-only and visual-only trials across three listening conditions: clear auditory signal, S/N ratio of -12 dB, and S/N ratio of -18 dB. The purpose was to obtain EEG recordings in conjunction with capacity to investigate how a late ERP co-varies with integration efficiency. Results showed efficient audiovisual integration for low auditory S/N ratios, but inefficient audiovisual integration when the auditory signal was clear. The ERP analyses showed evidence for greater audiovisual amplitude compared to the unisensory signals for lower auditory S/N ratios (higher capacity/efficiency) compared to the high S/N ratio (low capacity/inefficient integration). The data are consistent with an interactive framework of integration, where auditory recognition is influenced by speech-reading as a function of signal clarity.
Baart, Martijn; Stekelenburg, Jeroen J; Vroomen, Jean
Lip-read speech is integrated with heard speech at various neural levels. Here, we investigated the extent to which lip-read induced modulations of the auditory N1 and P2 (measured with EEG) are indicative of speech-specific audiovisual integration, and we explored to what extent the ERPs were modulated by phonetic audiovisual congruency. In order to disentangle speech-specific (phonetic) integration from non-speech integration, we used Sine-Wave Speech (SWS) that was perceived as speech by half of the participants (they were in speech-mode), while the other half was in non-speech mode. Results showed that the N1 obtained with audiovisual stimuli peaked earlier than the N1 evoked by auditory-only stimuli. This lip-read induced speeding up of the N1 occurred for listeners in speech and non-speech mode. In contrast, if listeners were in speech-mode, lip-read speech also modulated the auditory P2, but not if listeners were in non-speech mode, thus revealing speech-specific audiovisual binding. Comparing ERPs for phonetically congruent audiovisual stimuli with ERPs for incongruent stimuli revealed an effect of phonetic stimulus congruency that started at ~200 ms after (in)congruence became apparent. Critically, akin to the P2 suppression, congruency effects were only observed if listeners were in speech mode, and not if they were in non-speech mode. Using identical stimuli, we thus confirm that audiovisual binding involves (partially) different neural mechanisms for sound processing in speech and non-speech mode. © 2013 Published by Elsevier Ltd.
Pilling, Michael; Thomas, Sharon
Two experiments investigate the effectiveness of audiovisual (AV) speech cues (cues derived from both seeing and hearing a talker speak) in facilitating perceptual learning of spectrally distorted speech. Speech was distorted through an eight channel noise-vocoder which shifted the spectral envelope of the speech signal to simulate the properties…
Ozker, Muge; Schepers, Inga M; Magnotti, John F; Yoshor, Daniel; Beauchamp, Michael S
Human speech can be comprehended using only auditory information from the talker's voice. However, comprehension is improved if the talker's face is visible, especially if the auditory information is degraded as occurs in noisy environments or with hearing loss. We explored the neural substrates of audiovisual speech perception using electrocorticography, direct recording of neural activity using electrodes implanted on the cortical surface. We observed a double dissociation in the responses to audiovisual speech with clear and noisy auditory component within the superior temporal gyrus (STG), a region long known to be important for speech perception. Anterior STG showed greater neural activity to audiovisual speech with clear auditory component, whereas posterior STG showed similar or greater neural activity to audiovisual speech in which the speech was replaced with speech-like noise. A distinct border between the two response patterns was observed, demarcated by a landmark corresponding to the posterior margin of Heschl's gyrus. To further investigate the computational roles of both regions, we considered Bayesian models of multisensory integration, which predict that combining the independent sources of information available from different modalities should reduce variability in the neural responses. We tested this prediction by measuring the variability of the neural responses to single audiovisual words. Posterior STG showed smaller variability than anterior STG during presentation of audiovisual speech with noisy auditory component. Taken together, these results suggest that posterior STG but not anterior STG is important for multisensory integration of noisy auditory and visual speech.
Full Text Available The effect of stimulation history on the perception of a current event can yield two opposite effects, namely: adaptation or hysteresis. The perception of the current event thus goes in the opposite or in the same direction as prior stimulation, respectively. In audiovisual (AV synchrony perception, adaptation effects have primarily been reported. Here, we tested if perceptual hysteresis could also be observed over adaptation in AV timing perception by varying different experimental conditions. Participants were asked to judge the synchrony of the last (test stimulus of an AV sequence with either constant or gradually changing AV intervals (constant and dynamic condition, respectively. The onset timing of the test stimulus could be cued or not (prospective vs. retrospective condition, respectively. We observed hysteretic effects for AV synchrony judgments in the retrospective condition that were independent of the constant or dynamic nature of the adapted stimuli; these effects disappeared in the prospective condition. The present findings suggest that knowing when to estimate a stimulus property has a crucial impact on perceptual simultaneity judgments. Our results extend beyond AV timing perception, and have strong implications regarding the comparative study of hysteresis and adaptation phenomena.
Baart, M.; Stekelenburg, J.J.; Vroomen, J.
Lip-read speech is integrated with heard speech at various neural levels. Here, we investigated the extent to which lip-read induced modulations of the auditory N1 and P2 (measured with EEG) are indicative of speech-specific audiovisual integration, and we explored to what extent the ERPs were
Treille, Avril; Vilain, Coriandre; Kandel, Sonia; Sato, Marc
Previous electrophysiological studies have provided strong evidence for early multisensory integrative mechanisms during audiovisual speech perception. From these studies, one unanswered issue is whether hearing our own voice and seeing our own articulatory gestures facilitate speech perception, possibly through a better processing and integration of sensory inputs with our own sensory-motor knowledge. The present EEG study examined the impact of self-knowledge during the perception of auditory (A), visual (V) and audiovisual (AV) speech stimuli that were previously recorded from the participant or from a speaker he/she had never met. Audiovisual interactions were estimated by comparing N1 and P2 auditory evoked potentials during the bimodal condition (AV) with the sum of those observed in the unimodal conditions (A + V). In line with previous EEG studies, our results revealed an amplitude decrease of P2 auditory evoked potentials in AV compared to A + V conditions. Crucially, a temporal facilitation of N1 responses was observed during the visual perception of self speech movements compared to those of another speaker. This facilitation was negatively correlated with the saliency of visual stimuli. These results provide evidence for a temporal facilitation of the integration of auditory and visual speech signals when the visual situation involves our own speech gestures.
Mendi, Engin; Bayrak, Coskun
Learning disabilities affect the ability of children to learn, despite their having normal intelligence. Assistive tools can highly increase functional capabilities of children with learning disorders such as writing, reading, or listening. In this article, we describe a text-to-audiovisual synthesizer that can serve as an assistive tool for such children. The system automatically converts an input text to audiovisual speech, providing synchronization of the head, eye, and lip movements of the three-dimensional face model with appropriate facial expressions and word flow of the text. The proposed system can enhance speech perception and help children having learning deficits to improve their chances of success.
Jeanne A Guiraud
Full Text Available The language difficulties often seen in individuals with autism might stem from an inability to integrate audiovisual information, a skill important for language development. We investigated whether 9-month-old siblings of older children with autism, who are at an increased risk of developing autism, are able to integrate audiovisual speech cues. We used an eye-tracker to record where infants looked when shown a screen displaying two faces of the same model, where one face is articulating/ba/and the other/ga/, with one face congruent with the syllable sound being presented simultaneously, the other face incongruent. This method was successful in showing that infants at low risk can integrate audiovisual speech: they looked for the same amount of time at the mouths in both the fusible visual/ga/- audio/ba/and the congruent visual/ba/- audio/ba/displays, indicating that the auditory and visual streams fuse into a McGurk-type of syllabic percept in the incongruent condition. It also showed that low-risk infants could perceive a mismatch between auditory and visual cues: they looked longer at the mouth in the mismatched, non-fusible visual/ba/- audio/ga/display compared with the congruent visual/ga/- audio/ga/display, demonstrating that they perceive an uncommon, and therefore interesting, speech-like percept when looking at the incongruent mouth (repeated ANOVA: displays x fusion/mismatch conditions interaction: F(1,16 = 17.153, p = 0.001. The looking behaviour of high-risk infants did not differ according to the type of display, suggesting difficulties in matching auditory and visual information (repeated ANOVA, displays x conditions interaction: F(1,25 = 0.09, p = 0.767, in contrast to low-risk infants (repeated ANOVA: displays x conditions x low/high-risk groups interaction: F(1,41 = 4.466, p = 0.041. In some cases this reduced ability might lead to the poor communication skills characteristic of autism.
Full Text Available Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Typically, the visual mode of the synthetic speech is synthesized separately from the audio, the latter being either natural or synthesized speech. However, the perception of mismatches between these two information streams requires experimental exploration since it could degrade the quality of the output. In order to increase the intermodal coherence in synthetic 2D photorealistic speech, we extended the well-known unit selection audio synthesis technique to work with multimodal segments containing original combinations of audio and video. Subjective experiments confirm that the audiovisual signals created by our multimodal synthesis strategy are indeed perceived as being more synchronous than those of systems in which both modes are not intrinsically coherent. Furthermore, it is shown that the degree of coherence between the auditory mode and the visual mode has an influence on the perceived quality of the synthetic visual speech fragment. In addition, the audio quality was found to have only a minor influence on the perceived visual signal's quality.
Roa Romero, Yadira; Senkowski, Daniel; Keil, Julian
The McGurk illusion is a prominent example of audiovisual speech perception and the influence that visual stimuli can have on auditory perception. In this illusion, a visual speech stimulus influences the perception of an incongruent auditory stimulus, resulting in a fused novel percept. In this high-density electroencephalography (EEG) study, we were interested in the neural signatures of the subjective percept of the McGurk illusion as a phenomenon of speech-specific multisensory integration. Therefore, we examined the role of cortical oscillations and event-related responses in the perception of congruent and incongruent audiovisual speech. We compared the cortical activity elicited by objectively congruent syllables with incongruent audiovisual stimuli. Importantly, the latter elicited a subjectively congruent percept: the McGurk illusion. We found that early event-related responses (N1) to audiovisual stimuli were reduced during the perception of the McGurk illusion compared with congruent stimuli. Most interestingly, our study showed a stronger poststimulus suppression of beta-band power (13-30 Hz) at short (0-500 ms) and long (500-800 ms) latencies during the perception of the McGurk illusion compared with congruent stimuli. Our study demonstrates that auditory perception is influenced by visual context and that the subsequent formation of a McGurk illusion requires stronger audiovisual integration even at early processing stages. Our results provide evidence that beta-band suppression at early stages reflects stronger stimulus processing in the McGurk illusion. Moreover, stronger late beta-band suppression in McGurk illusion indicates the resolution of incongruent physical audiovisual input and the formation of a coherent, illusory multisensory percept. Copyright © 2015 the American Physiological Society.
Lebib, Riadh; Papo, David; Douiri, Abdel; de Bode, Stella; Gillon Dowens, Margaret; Baudonnière, Pierre-Marie
Lipreading reliably improve speech perception during face-to-face conversation. Within the range of good dubbing, however, adults tolerate some audiovisual (AV) discrepancies and lipreading, then, can give rise to confusion. We used event-related brain potentials (ERPs) to study the perceptual strategies governing the intermodal processing of dynamic and bimodal speech stimuli, either congruently dubbed or not. Electrophysiological analyses revealed that non-coherent audiovisual dubbings modulated in amplitude an endogenous ERP component, the N300, we compared to a 'N400-like effect' reflecting the difficulty to integrate these conflicting pieces of information. This result adds further support for the existence of a cerebral system underlying 'integrative processes' lato sensu. Further studies should take advantage of this 'N400-like effect' with AV speech stimuli to open new perspectives in the domain of psycholinguistics.
Hashimoto, Masahiro; Kumashiro, Masaharu
The purpose of this study was to investigate the limitations of lip-reading advantages for Japanese young adults by desynchronizing visual and auditory information in speech. In the experiment, audio-visual speech stimuli were presented under the six test conditions: audio-alone, and audio-visually with either 0, 60, 120, 240 or 480 ms of audio delay. The stimuli were the video recordings of a face of a female Japanese speaking long and short Japanese sentences. The intelligibility of the audio-visual stimuli was measured as a function of audio delays in sixteen untrained young subjects. Speech intelligibility under the audio-delay condition of less than 120 ms was significantly better than that under the audio-alone condition. On the other hand, the delay of 120 ms corresponded to the mean mora duration measured for the audio stimuli. The results implied that audio delays of up to 120 ms would not disrupt lip-reading advantage, because visual and auditory information in speech seemed to be integrated on a syllabic time scale. Potential applications of this research include noisy workplace in which a worker must extract relevant speech from all the other competing noises.
Shahin, Antoine J; Shen, Stanley; Kerlin, Jess R
We examined the relationship between tolerance for audiovisual onset asynchrony (AVOA) and the spectrotemporal fidelity of the spoken words and the speaker's mouth movements. In two experiments that only varied in the temporal order of sensory modality, visual speech leading (exp1) or lagging (exp2) acoustic speech, participants watched intact and blurred videos of a speaker uttering trisyllabic words and nonwords that were noise vocoded with 4-, 8-, 16-, and 32-channels. They judged whether the speaker's mouth movements and the speech sounds were in-sync or out-of-sync . Individuals perceived synchrony (tolerated AVOA) on more trials when the acoustic speech was more speech-like (8 channels and higher vs. 4 channels), and when visual speech was intact than blurred (exp1 only). These findings suggest that enhanced spectrotemporal fidelity of the audiovisual (AV) signal prompts the brain to widen the window of integration promoting the fusion of temporally distant AV percepts.
Erickson, Laura C; Heeg, Elizabeth; Rauschecker, Josef P; Turkeltaub, Peter E
The brain improves speech processing through the integration of audiovisual (AV) signals. Situations involving AV speech integration may be crudely dichotomized into those where auditory and visual inputs contain (1) equivalent, complementary signals (validating AV speech) or (2) inconsistent, different signals (conflicting AV speech). This simple framework may allow the systematic examination of broad commonalities and differences between AV neural processes engaged by various experimental paradigms frequently used to study AV speech integration. We conducted an activation likelihood estimation metaanalysis of 22 functional imaging studies comprising 33 experiments, 311 subjects, and 347 foci examining "conflicting" versus "validating" AV speech. Experimental paradigms included content congruency, timing synchrony, and perceptual measures, such as the McGurk effect or synchrony judgments, across AV speech stimulus types (sublexical to sentence). Colocalization of conflicting AV speech experiments revealed consistency across at least two contrast types (e.g., synchrony and congruency) in a network of dorsal stream regions in the frontal, parietal, and temporal lobes. There was consistency across all contrast types (synchrony, congruency, and percept) in the bilateral posterior superior/middle temporal cortex. Although fewer studies were available, validating AV speech experiments were localized to other regions, such as ventral stream visual areas in the occipital and inferior temporal cortex. These results suggest that while equivalent, complementary AV speech signals may evoke activity in regions related to the corroboration of sensory input, conflicting AV speech signals recruit widespread dorsal stream areas likely involved in the resolution of conflicting sensory signals. Copyright © 2014 Wiley Periodicals, Inc.
Full Text Available An increasing number of neuroscience papers capitalize on the assumption published in this journal that visual speech would be typically 150 ms ahead of auditory speech. It happens that the estimation of audiovisual asynchrony in the reference paper is valid only in very specific cases, for isolated consonant-vowel syllables or at the beginning of a speech utterance, in what we call "preparatory gestures". However, when syllables are chained in sequences, as they are typically in most parts of a natural speech utterance, asynchrony should be defined in a different way. This is what we call "comodulatory gestures" providing auditory and visual events more or less in synchrony. We provide audiovisual data on sequences of plosive-vowel syllables (pa, ta, ka, ba, da, ga, ma, na showing that audiovisual synchrony is actually rather precise, varying between 20 ms audio lead and 70 ms audio lag. We show how more complex speech material should result in a range typically varying between 40 ms audio lead and 200 ms audio lag, and we discuss how this natural coordination is reflected in the so-called temporal integration window for audiovisual speech perception. Finally we present a toy model of auditory and audiovisual predictive coding, showing that visual lead is actually not necessary for visual prediction.
Hwee Ling eLee
Full Text Available This psychophysics study used musicians as a model to investigate whether musical expertise shapes the temporal integration window for audiovisual speech, sinewave speech or music. Musicians and non-musicians judged the audiovisual synchrony of speech, sinewave analogues of speech, and music stimuli at 13 audiovisual stimulus onset asynchronies (±360, ±300 ±240, ±180, ±120, ±60, and 0 ms. Further, we manipulated the duration of the stimuli by presenting sentences/melodies or syllables/tones. Critically, musicians relative to non-musicians exhibited significantly narrower temporal integration windows for both music and sinewave speech. Further, the temporal integration window for music decreased with the amount of music practice, but not with age of acquisition. In other words, the more musicians practiced piano in the past three years, the more sensitive they became to the temporal misalignment of visual and auditory signals. Collectively, our findings demonstrate that music practicing fine-tunes the audiovisual temporal integration window to various extents depending on the stimulus class. While the effect of piano practicing was most pronounced for music, it also generalized to other stimulus classes such as sinewave speech and to a marginally significant degree to natural speech.
Hickok, Gregory; Rogalsky, Corianne; Matchin, William; Basilakos, Alexandra; Cai, Julia; Pillay, Sara; Ferrill, Michelle; Mickelsen, Soren; Anderson, Steven W; Love, Tracy; Binder, Jeffrey; Fridriksson, Julius
Auditory and visual speech information are often strongly integrated resulting in perceptual enhancements for audiovisual (AV) speech over audio alone and sometimes yielding compelling illusory fusion percepts when AV cues are mismatched, the McGurk-MacDonald effect. Previous research has identified three candidate regions thought to be critical for AV speech integration: the posterior superior temporal sulcus (STS), early auditory cortex, and the posterior inferior frontal gyrus. We assess the causal involvement of these regions (and others) in the first large-scale (N = 100) lesion-based study of AV speech integration. Two primary findings emerged. First, behavioral performance and lesion maps for AV enhancement and illusory fusion measures indicate that classic metrics of AV speech integration are not necessarily measuring the same process. Second, lesions involving superior temporal auditory, lateral occipital visual, and multisensory zones in the STS are the most disruptive to AV speech integration. Further, when AV speech integration fails, the nature of the failure-auditory vs visual capture-can be predicted from the location of the lesions. These findings show that AV speech processing is supported by unimodal auditory and visual cortices as well as multimodal regions such as the STS at their boundary. Motor related frontal regions do not appear to play a role in AV speech integration. Copyright © 2018 Elsevier Ltd. All rights reserved.
Full Text Available Humans rely on multiple sensory modalities to determine the emotional state of others. In fact, such multisensory perception may be one of the mechanisms explaining the ease and efficiency by which others’ emotions are recognized. But how and when exactly do the different modalities interact? One aspect in multisensory perception that has received increasing interest in recent years is the concept of crossmodal prediction. In emotion perception, as in most other settings, visual information precedes the auditory one. Thereby, leading in visual information can facilitate subsequent auditory processing. While this mechanism has often been described in audiovisual speech perception, it has not been addressed so far in audiovisual emotion perception. Based on the current state of the art in (a crossmodal prediction and (b multisensory emotion perception research, we propose that it is essential to consider the former in order to fully understand the latter. Focusing on electroencephalographic (EEG and magnetoencephalographic (MEG studies, we provide a brief overview of the current research in both fields. In discussing these findings, we suggest that emotional visual information may allow for a more reliable prediction of auditory information compared to non-emotional visual information. In support of this hypothesis, we present a re-analysis of a previous data set that shows an inverse correlation between the N1 response in the EEG and the duration of visual emotional but not non-emotional information. If the assumption that emotional content allows for more reliable predictions can be corroborated in future studies, crossmodal prediction is a crucial factor in our understanding of multisensory emotion perception.
Jessen, Sarah; Kotz, Sonja A
Humans rely on multiple sensory modalities to determine the emotional state of others. In fact, such multisensory perception may be one of the mechanisms explaining the ease and efficiency by which others' emotions are recognized. But how and when exactly do the different modalities interact? One aspect in multisensory perception that has received increasing interest in recent years is the concept of cross-modal prediction. In emotion perception, as in most other settings, visual information precedes the auditory information. Thereby, leading in visual information can facilitate subsequent auditory processing. While this mechanism has often been described in audiovisual speech perception, so far it has not been addressed in audiovisual emotion perception. Based on the current state of the art in (a) cross-modal prediction and (b) multisensory emotion perception research, we propose that it is essential to consider the former in order to fully understand the latter. Focusing on electroencephalographic (EEG) and magnetoencephalographic (MEG) studies, we provide a brief overview of the current research in both fields. In discussing these findings, we suggest that emotional visual information may allow more reliable predicting of auditory information compared to non-emotional visual information. In support of this hypothesis, we present a re-analysis of a previous data set that shows an inverse correlation between the N1 EEG response and the duration of visual emotional, but not non-emotional information. If the assumption that emotional content allows more reliable predicting can be corroborated in future studies, cross-modal prediction is a crucial factor in our understanding of multisensory emotion perception.
Altieri, Nicholas; Pisoni, David B.; Townsend, James T.
Summerfield (1987) proposed several accounts of audiovisual speech perception, a field of research that has burgeoned in recent years. The proposed accounts included the integration of discrete phonetic features, vectors describing the values of independent acoustical and optical parameters, the filter function of the vocal tract, and articulatory dynamics of the vocal tract. The latter two accounts assume that the representations of audiovisual speech perception are based on abstract gestures, while the former two assume that the representations consist of symbolic or featural information obtained from visual and auditory modalities. Recent converging evidence from several different disciplines reveals that the general framework of Summerfield’s feature-based theories should be expanded. An updated framework building upon the feature-based theories is presented. We propose a processing model arguing that auditory and visual brain circuits provide facilitatory information when the inputs are correctly timed, and that auditory and visual speech representations do not necessarily undergo translation into a common code during information processing. Future research on multisensory processing in speech perception should investigate the connections between auditory and visual brain regions, and utilize dynamic modeling tools to further understand the timing and information processing mechanisms involved in audiovisual speech integration. PMID:21968081
Andersen, Tobias; Starrfelt, Randi
's area is necessary for audiovisual integration of speech. Here we describe a patient with Broca's aphasia who experienced the McGurk illusion. This indicates that an intact Broca's area is not necessary for audiovisual integration of speech. The McGurk illusions this patient experienced were atypical......, which could be due to Broca's area having a more subtle role in audiovisual integration of speech. The McGurk illusions of a control subject with Wernicke's aphasia were, however, also atypical. This indicates that the atypical McGurk illusions were due to deficits in speech processing...
Full Text Available Events encoded in separate sensory modalities, such as audition and vision, can seem to be synchronous across a relatively broad range of physical timing differences. This may suggest that the precision of audio-visual timing judgments is inherently poor. Here we show that this is not necessarily true. We contrast timing sensitivity for isolated streams of audio and visual speech, and for streams of audio and visual speech accompanied by additional, temporally offset, visual speech streams. We find that the precision with which synchronous streams of audio and visual speech are identified is enhanced by the presence of additional streams of asynchronous visual speech. Our data suggest that timing perception is shaped by selective grouping processes, which can result in enhanced precision in temporally cluttered environments. The imprecision suggested by previous studies might therefore be a consequence of examining isolated pairs of audio and visual events. We argue that when an isolated pair of cross-modal events is presented, they tend to group perceptually and to seem synchronous as a consequence. We have revealed greater precision by providing multiple visual signals, possibly allowing a single auditory speech stream to group selectively with the most synchronous visual candidate. The grouping processes we have identified might be important in daily life, such as when we attempt to follow a conversation in a crowded room.
Vatakis, Argiro; Maragos, Petros; Rodomagoulakis, Isidoros; Spence, Charles
We investigated how the physical differences associated with the articulation of speech affect the temporal aspects of audiovisual speech perception. Video clips of consonants and vowels uttered by three different speakers were presented. The video clips were analyzed using an auditory-visual signal saliency model in order to compare signal saliency and behavioral data. Participants made temporal order judgments (TOJs) regarding which speech-stream (auditory or visual) had been presented first. The sensitivity of participants' TOJs and the point of subjective simultaneity (PSS) were analyzed as a function of the place, manner of articulation, and voicing for consonants, and the height/backness of the tongue and lip-roundedness for vowels. We expected that in the case of the place of articulation and roundedness, where the visual-speech signal is more salient, temporal perception of speech would be modulated by the visual-speech signal. No such effect was expected for the manner of articulation or height. The results demonstrate that for place and manner of articulation, participants' temporal percept was affected (although not always significantly) by highly-salient speech-signals with the visual-signals requiring smaller visual-leads at the PSS. This was not the case when height was evaluated. These findings suggest that in the case of audiovisual speech perception, a highly salient visual-speech signal may lead to higher probabilities regarding the identity of the auditory-signal that modulate the temporal window of multisensory integration of the speech-stimulus. PMID:23060756
Başkent, Deniz; Gaudrain, Etienne
Evidence for transfer of musical training to better perception of speech in noise has been mixed. Unlike speech-in-noise, speech-on-speech perception utilizes many of the skills that musical training improves, such as better pitch perception and stream segregation, as well as use of higher-level
Evitts, Paul M.; Portugal, Lindsay; Van Dine, Ami; Holler, Aline
Background: There is minimal research on the contribution of visual information on speech intelligibility for individuals with a laryngectomy (IWL). Aims: The purpose of this project was to determine the effects of mode of presentation (audio-only, audio-visual) on alaryngeal speech intelligibility. Method: Twenty-three naive listeners were…
Kuling, I.A.; Kohlrausch, A.G.; Juola, J.F.
The integration of visual and auditory inputs in the human brain works properly only if the components are perceived in close temporal proximity. In the present study, we quantified cross-modal interactions in the human brain for audiovisual stimuli with temporal asynchronies, using a paradigm from
Baum, Sarah H; Martin, Randi C; Hamilton, A Cris; Beauchamp, Michael S
Converging evidence suggests that the left superior temporal sulcus (STS) is a critical site for multisensory integration of auditory and visual information during speech perception. We report a patient, SJ, who suffered a stroke that damaged the left tempo-parietal area, resulting in mild anomic aphasia. Structural MRI showed complete destruction of the left middle and posterior STS, as well as damage to adjacent areas in the temporal and parietal lobes. Surprisingly, SJ demonstrated preserved multisensory integration measured with two independent tests. First, she perceived the McGurk effect, an illusion that requires integration of auditory and visual speech. Second, her perception of morphed audiovisual speech with ambiguous auditory or visual information was significantly influenced by the opposing modality. To understand the neural basis for this preserved multisensory integration, blood-oxygen level dependent functional magnetic resonance imaging (BOLD fMRI) was used to examine brain responses to audiovisual speech in SJ and 23 healthy age-matched controls. In controls, bilateral STS activity was observed. In SJ, no activity was observed in the damaged left STS but in the right STS, more cortex was active in SJ than in any of the normal controls. Further, the amplitude of the BOLD response in right STS response to McGurk stimuli was significantly greater in SJ than in controls. The simplest explanation of these results is a reorganization of SJ's cortical language networks such that the right STS now subserves multisensory integration of speech. Copyright © 2012 Elsevier Inc. All rights reserved.
Kushnerenko, Elena; Tomalski, Przemyslaw; Ballieux, Haiko; Ribeiro, Helena; Potton, Anita; Axelsson, Emma L; Murphy, Elizabeth; Moore, Derek G
Research on audiovisual speech integration has reported high levels of individual variability, especially among young infants. In the present study we tested the hypothesis that this variability results from individual differences in the maturation of audiovisual speech processing during infancy. A developmental shift in selective attention to audiovisual speech has been demonstrated between 6 and 9 months with an increase in the time spent looking to articulating mouths as compared to eyes (Lewkowicz & Hansen-Tift. (2012) Proc. Natl Acad. Sci. USA, 109, 1431-1436; Tomalski et al. (2012) Eur. J. Dev. Psychol., 1-14). In the present study we tested whether these changes in behavioural maturational level are associated with differences in brain responses to audiovisual speech across this age range. We measured high-density event-related potentials (ERPs) in response to videos of audiovisually matching and mismatched syllables /ba/ and /ga/, and subsequently examined visual scanning of the same stimuli with eye-tracking. There were no clear age-specific changes in ERPs, but the amplitude of audiovisual mismatch response (AVMMR) to the combination of visual /ba/ and auditory /ga/ was strongly negatively associated with looking time to the mouth in the same condition. These results have significant implications for our understanding of individual differences in neural signatures of audiovisual speech processing in infants, suggesting that they are not strictly related to chronological age but instead associated with the maturation of looking behaviour, and develop at individual rates in the second half of the first year of life. © 2013 Federation of European Neuroscience Societies and John Wiley & Sons Ltd.
Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti
Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.
Rosenblum, Lawrence D.
Speech perception is inherently multimodal. Visual speech (lip-reading) information is used by all perceivers and readily integrates with auditory speech. Imaging research suggests that the brain treats auditory and visual speech similarly. These findings have led some researchers to consider that speech perception works by extracting amodal information that takes the same form across modalities. From this perspective, speech integration is a property of the input information itself. Amodal s...
Van Bree, K.C.
For many speech telecommunication technologies a robust speech activity detector is important. An audio-only speech detector will givefalse positives when the interfering signal is speech or has speech characteristics. The modality video is suitable to solve this problem. In this report the approach
Richards, Michael D.; Goltz, Herbert C.; Wong, Agnes M. F.
Amblyopia is a developmental visual impairment that is increasingly recognized to affect higher-level perceptual and multisensory processes. To further investigate the audiovisual (AV) perceptual impairments associated with this condition, we characterized the temporal interval in which asynchronous auditory and visual stimuli are perceived as simultaneous 50% of the time (i.e., the AV simultaneity window). Adults with unilateral amblyopia (n = 17) and visually normal controls (n = 17) judged...
Full Text Available In blind people, the visual channel cannot assist face-to-face communication via lipreading or visual prosody. Nevertheless, the visual system may enhance the evaluation of auditory information due to its cross-links to (1 the auditory system, (2 supramodal representations, and (3 frontal action-related areas. Apart from feedback or top-down support of, for example, the processing of spatial or phonological representations, experimental data have shown that the visual system can impact auditory perception at more basic computational stages such as temporal resolution. For example, blind as compared to sighted subjects are more resistant against backward masking, and this ability appears to be associated with activity in visual cortex. Regarding the comprehension of continuous speech, blind subjects can learn to use accelerated text-to-speech systems for "reading" texts at ultra-fast speaking rates (> 16 syllables/s, exceeding by far the normal range of 6 syllables/s. An fMRI study has shown that this ability, among other brain regions, significantly covaries with BOLD responses in bilateral pulvinar, right visual cortex, and left supplementary motor area. Furthermore, magnetoencephalographic (MEG measurements revealed a particular component in right occipital cortex phase-locked to the syllable onsets of accelerated speech. In sighted people, the "bottleneck" for understanding time-compressed speech seems related to a demand for buffering phonological material and is, presumably, linked to frontal brain structures. On the other hand, the neurophysiological correlates of functions overcoming this bottleneck, seem to depend upon early visual cortex activity. The present Hypothesis and Theory paper outlines a model that aims at binding these data together, based on early cross-modal pathways that are already known from various audiovisual experiments considering cross-modal adjustments in space, time, and object recognition.
Richards, Michael D; Goltz, Herbert C; Wong, Agnes M F
Amblyopia is a developmental visual impairment that is increasingly recognized to affect higher-level perceptual and multisensory processes. To further investigate the audiovisual (AV) perceptual impairments associated with this condition, we characterized the temporal interval in which asynchronous auditory and visual stimuli are perceived as simultaneous 50% of the time (i.e., the AV simultaneity window). Adults with unilateral amblyopia (n = 17) and visually normal controls (n = 17) judged the simultaneity of a flash and a click presented with both eyes viewing. The signal onset asynchrony (SOA) varied from 0 ms to 450 ms for auditory-lead and visual-lead conditions. A subset of participants with amblyopia (n = 6) was tested monocularly. Compared to the control group, the auditory-lead side of the AV simultaneity window was widened by 48 ms (36%; p = 0.002), whereas that of the visual-lead side was widened by 86 ms (37%; p = 0.02). The overall mean window width was 500 ms, compared to 366 ms among controls (37% wider; p = 0.002). Among participants with amblyopia, the simultaneity window parameters were unchanged by viewing condition, but subgroup analysis revealed differential effects on the parameters by amblyopia severity, etiology, and foveal suppression status. Possible mechanisms to explain these findings include visual temporal uncertainty, interocular perceptual latency asymmetry, and disruption of normal developmental tuning of sensitivity to audiovisual asynchrony.
Michael D Richards
Full Text Available Amblyopia is a developmental visual impairment that is increasingly recognized to affect higher-level perceptual and multisensory processes. To further investigate the audiovisual (AV perceptual impairments associated with this condition, we characterized the temporal interval in which asynchronous auditory and visual stimuli are perceived as simultaneous 50% of the time (i.e., the AV simultaneity window. Adults with unilateral amblyopia (n = 17 and visually normal controls (n = 17 judged the simultaneity of a flash and a click presented with both eyes viewing. The signal onset asynchrony (SOA varied from 0 ms to 450 ms for auditory-lead and visual-lead conditions. A subset of participants with amblyopia (n = 6 was tested monocularly. Compared to the control group, the auditory-lead side of the AV simultaneity window was widened by 48 ms (36%; p = 0.002, whereas that of the visual-lead side was widened by 86 ms (37%; p = 0.02. The overall mean window width was 500 ms, compared to 366 ms among controls (37% wider; p = 0.002. Among participants with amblyopia, the simultaneity window parameters were unchanged by viewing condition, but subgroup analysis revealed differential effects on the parameters by amblyopia severity, etiology, and foveal suppression status. Possible mechanisms to explain these findings include visual temporal uncertainty, interocular perceptual latency asymmetry, and disruption of normal developmental tuning of sensitivity to audiovisual asynchrony.
Lidestam, Björn; Rönnberg, Jerker
The present study compared elderly hearing aid (EHA) users (n = 20) with elderly normal-hearing (ENH) listeners (n = 20) in terms of isolation points (IPs, the shortest time required for correct identification of a speech stimulus) and accuracy of audiovisual gated speech stimuli (consonants, words, and final words in highly and less predictable sentences) presented in silence. In addition, we compared the IPs of audiovisual speech stimuli from the present study with auditory ones extracted from a previous study, to determine the impact of the addition of visual cues. Both participant groups achieved ceiling levels in terms of accuracy in the audiovisual identification of gated speech stimuli; however, the EHA group needed longer IPs for the audiovisual identification of consonants and words. The benefit of adding visual cues to auditory speech stimuli was more evident in the EHA group, as audiovisual presentation significantly shortened the IPs for consonants, words, and final words in less predictable sentences; in the ENH group, audiovisual presentation only shortened the IPs for consonants and words. In conclusion, although the audiovisual benefit was greater for EHA group, this group had inferior performance compared with the ENH group in terms of IPs when supportive semantic context was lacking. Consequently, EHA users needed the initial part of the audiovisual speech signal to be longer than did their counterparts with normal hearing to reach the same level of accuracy in the absence of a semantic context. PMID:27317667
Kronschnabel, Jens; Brem, Silvia; Maurer, Urs; Brandeis, Daniel
The classical phonological deficit account of dyslexia is increasingly linked to impairments in grapho-phonological conversion, and to dysfunctions in superior temporal regions associated with audiovisual integration. The present study investigates mechanisms of audiovisual integration in typical and impaired readers at the critical developmental stage of adolescence. Congruent and incongruent audiovisual as well as unimodal (visual only and auditory only) material was presented. Audiovisual presentations were single letters and three-letter (consonant-vowel-consonant) stimuli accompanied by matching or mismatching speech sounds. Three-letter stimuli exhibited fast phonetic transitions as in real-life language processing and reading. Congruency effects, i.e. different brain responses to congruent and incongruent stimuli were taken as an indicator of audiovisual integration at a phonetic level (grapho-phonological conversion). Comparisons of unimodal and audiovisual stimuli revealed basic, more sensory aspects of audiovisual integration. By means of these two criteria of audiovisual integration, the generalizability of audiovisual deficits in dyslexia was tested. Moreover, it was expected that the more naturalistic three-letter stimuli are superior to single letters in revealing group differences. Electrophysiological and hemodynamic (EEG and fMRI) data were acquired simultaneously in a simple target detection task. Applying the same statistical models to event-related EEG potentials and fMRI responses allowed comparing the effects detected by the two techniques at a descriptive level. Group differences in congruency effects (congruent against incongruent) were observed in regions involved in grapho-phonological processing, including the left inferior frontal and angular gyri and the inferotemporal cortex. Importantly, such differences also emerged in superior temporal key regions. Three-letter stimuli revealed stronger group differences than single letters. No
Moradi, Shahram; Lidestam, Björn; Rönnberg, Jerker
The present study compared elderly hearing aid (EHA) users (n = 20) with elderly normal-hearing (ENH) listeners (n = 20) in terms of isolation points (IPs, the shortest time required for correct identification of a speech stimulus) and accuracy of audiovisual gated speech stimuli (consonants, words, and final words in highly and less predictable sentences) presented in silence. In addition, we compared the IPs of audiovisual speech stimuli from the present study with auditory ones extracted from a previous study, to determine the impact of the addition of visual cues. Both participant groups achieved ceiling levels in terms of accuracy in the audiovisual identification of gated speech stimuli; however, the EHA group needed longer IPs for the audiovisual identification of consonants and words. The benefit of adding visual cues to auditory speech stimuli was more evident in the EHA group, as audiovisual presentation significantly shortened the IPs for consonants, words, and final words in less predictable sentences; in the ENH group, audiovisual presentation only shortened the IPs for consonants and words. In conclusion, although the audiovisual benefit was greater for EHA group, this group had inferior performance compared with the ENH group in terms of IPs when supportive semantic context was lacking. Consequently, EHA users needed the initial part of the audiovisual speech signal to be longer than did their counterparts with normal hearing to reach the same level of accuracy in the absence of a semantic context. © The Author(s) 2016.
Nie, Yingjiu; Galvin, John J; Morikawa, Michael; André, Victoria; Wheeler, Harley; Fu, Qian-Jie
This study examined music and speech perception in normal-hearing children with some or no musical training. Thirty children (mean age = 11.3 years), 15 with and 15 without formal music training participated in the study. Music perception was measured using a melodic contour identification (MCI) task; stimuli were a piano sample or sung speech with a fixed timbre (same word for each note) or a mixed timbre (different words for each note). Speech perception was measured in quiet and in steady noise using a matrix-styled sentence recognition task; stimuli were naturally intonated speech or sung speech with a fixed pitch (same note for each word) or a mixed pitch (different notes for each word). Significant musician advantages were observed for MCI and speech in noise but not for speech in quiet. MCI performance was significantly poorer with the mixed timbre stimuli. Speech performance in noise was significantly poorer with the fixed or mixed pitch stimuli than with spoken speech. Across all subjects, age at testing and MCI performance were significantly correlated with speech performance in noise. MCI and speech performance in quiet was significantly poorer for children than for adults from a related study using the same stimuli and tasks; speech performance in noise was significantly poorer for young than for older children. Long-term music training appeared to benefit melodic pitch perception and speech understanding in noise in these pediatric listeners.
Petridis, Stavros; Pantic, Maja
Past research on automatic laughter classification/detection has focused mainly on audio-based approaches. Here we present an audiovisual approach to distinguishing laughter from speech, and we show that integrating the information from audio and video channels may lead to improved performance over
Full Text Available Audio-visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi-stream Dynamic Bayesian Network and coupled HMM are widely used for audio-visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN to perform unsupervised extraction of spatial-temporal multimodal features from Tibetan audio-visual speech data and build an accurate audio-visual speech recognition model under a no frame-independency assumption. The experiment results on Tibetan speech data from some real-world environments showed the proposed DDBN outperforms the state-of-art methods in word recognition accuracy.
Moradi, Shahram; Lidestam, Björn; Rönnberg, Jerker
This study investigated the degree to which audiovisual presentation (compared to auditory-only presentation) affected isolation point (IPs, the amount of time required for the correct identification of speech stimuli using a gating paradigm) in silence and noise conditions. The study expanded on the findings of Moradi et al. (under revision), using the same stimuli, but presented in an audiovisual instead of an auditory-only manner. The results showed that noise impeded the identification of consonants and words (i.e., delayed IPs and lowered accuracy), but not the identification of final words in sentences. In comparison with the previous study by Moradi et al., it can be concluded that the provision of visual cues expedited IPs and increased the accuracy of speech stimuli identification in both silence and noise. The implication of the results is discussed in terms of models for speech understanding. PMID:23801980
Valkenier, Bea; Duyne, Jurriaan Y.; Andringa, Tjeerd C.; Baskent, Deniz
Purpose: Auditory perception of vowels in background noise is enhanced when combined with visually perceived speech features. The objective of this study was to investigate whether the influence of visual cues on vowel perception extends to incongruent vowels, in a manner similar to the McGurk effect observed with consonants. Method:…
Petar S. Aleksic
Full Text Available We describe an audio-visual automatic continuous speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system utilizes facial animation parameters (FAPs supported by the MPEG-4 standard for the visual representation of speech. We also describe a robust and automatic algorithm we have developed to extract FAPs from visual data, which does not require hand labeling or extensive training procedures. The principal component analysis (PCA was performed on the FAPs in order to decrease the dimensionality of the visual feature vectors, and the derived projection weights were used as visual features in the audio-visual automatic speech recognition (ASR experiments. Both single-stream and multistream hidden Markov models (HMMs were used to model the ASR system, integrate audio and visual information, and perform a relatively large vocabulary (approximately 1000 words speech recognition experiments. The experiments performed use clean audio data and audio data corrupted by stationary white Gaussian noise at various SNRs. The proposed system reduces the word error rate (WER by 20% to 23% relatively to audio-only speech recognition WERs, at various SNRs (0Ã¢Â€Â“30 dB with additive white Gaussian noise, and by 19% relatively to audio-only speech recognition WER under clean audio conditions.
Petridis, Stavros; Asghar, Ali; Pantic, Maja
In this study, a system that discriminates laughter from speech by modelling the relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and
Schormans, Ashley L; Scott, Kaela E; Vo, Albert M Q; Tyker, Anna; Typlt, Marei; Stolzberg, Daniel; Allman, Brian L
Extensive research on humans has improved our understanding of how the brain integrates information from our different senses, and has begun to uncover the brain regions and large-scale neural activity that contributes to an observer's ability to perceive the relative timing of auditory and visual stimuli. In the present study, we developed the first behavioral tasks to assess the perception of audiovisual temporal synchrony in rats. Modeled after the parameters used in human studies, separate groups of rats were trained to perform: (1) a simultaneity judgment task in which they reported whether audiovisual stimuli at various stimulus onset asynchronies (SOAs) were presented simultaneously or not; and (2) a temporal order judgment task in which they reported whether they perceived the auditory or visual stimulus to have been presented first. Furthermore, using in vivo electrophysiological recordings in the lateral extrastriate visual (V2L) cortex of anesthetized rats, we performed the first investigation of how neurons in the rat multisensory cortex integrate audiovisual stimuli presented at different SOAs. As predicted, rats ( n = 7) trained to perform the simultaneity judgment task could accurately (~80%) identify synchronous vs. asynchronous (200 ms SOA) trials. Moreover, the rats judged trials at 10 ms SOA to be synchronous, whereas the majority (~70%) of trials at 100 ms SOA were perceived to be asynchronous. During the temporal order judgment task, rats ( n = 7) perceived the synchronous audiovisual stimuli to be "visual first" for ~52% of the trials, and calculation of the smallest timing interval between the auditory and visual stimuli that could be detected in each rat (i.e., the just noticeable difference (JND)) ranged from 77 ms to 122 ms. Neurons in the rat V2L cortex were sensitive to the timing of audiovisual stimuli, such that spiking activity was greatest during trials when the visual stimulus preceded the auditory by 20-40 ms. Ultimately, given
Ye, Zheng; Rüsseler, Jascha; Gerth, Ivonne; Münte, Thomas F
Dyslexia is an impairment of reading and spelling that affects both children and adults even after many years of schooling. Dyslexic readers have deficits in the integration of auditory and visual inputs but the neural mechanisms of the deficits are still unclear. This fMRI study examined the neural processing of auditorily presented German numbers 0-9 and videos of lip movements of a German native speaker voicing numbers 0-9 in unimodal (auditory or visual) and bimodal (always congruent) conditions in dyslexic readers and their matched fluent readers. We confirmed results of previous studies that the superior temporal gyrus/sulcus plays a critical role in audiovisual speech integration: fluent readers showed greater superior temporal activations for combined audiovisual stimuli than auditory-/visual-only stimuli. Importantly, such an enhancement effect was absent in dyslexic readers. Moreover, the auditory network (bilateral superior temporal regions plus medial PFC) was dynamically modulated during audiovisual integration in fluent, but not in dyslexic readers. These results suggest that superior temporal dysfunction may underly poor audiovisual speech integration in readers with dyslexia. Copyright © 2017 IBRO. Published by Elsevier Ltd. All rights reserved.
Bruderer, Alison G; Danielson, D Kyle; Kandhadai, Padmapriya; Werker, Janet F
The influence of speech production on speech perception is well established in adults. However, because adults have a long history of both perceiving and producing speech, the extent to which the perception-production linkage is due to experience is unknown. We addressed this issue by asking whether articulatory configurations can influence infants' speech perception performance. To eliminate influences from specific linguistic experience, we studied preverbal, 6-mo-old infants and tested the discrimination of a nonnative, and hence never-before-experienced, speech sound distinction. In three experimental studies, we used teething toys to control the position and movement of the tongue tip while the infants listened to the speech sounds. Using ultrasound imaging technology, we verified that the teething toys consistently and effectively constrained the movement and positioning of infants' tongues. With a looking-time procedure, we found that temporarily restraining infants' articulators impeded their discrimination of a nonnative consonant contrast but only when the relevant articulator was selectively restrained to prevent the movements associated with producing those sounds. Our results provide striking evidence that even before infants speak their first words and without specific listening experience, sensorimotor information from the articulators influences speech perception. These results transform theories of speech perception by suggesting that even at the initial stages of development, oral-motor movements influence speech sound discrimination. Moreover, an experimentally induced "impairment" in articulator movement can compromise speech perception performance, raising the question of whether long-term oral-motor impairments may impact perceptual development.
Massaro, Dominic W; Chen, Trevor H
Galantucci, Fowler, and Turvey (2006) have claimed that perceiving speech is perceiving gestures and that the motor system is recruited for perceiving speech. We make the counter argument that perceiving speech is not perceiving gestures, that the motor system is not recruitedfor perceiving speech, and that speech perception can be adequately described by a prototypical pattern recognition model, the fuzzy logical model of perception (FLMP). Empirical evidence taken as support for gesture and motor theory is reconsidered in more detail and in the framework of the FLMR Additional theoretical and logical arguments are made to challenge gesture and motor theory.
Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.
Venezia, Jonathan H; Vaden, Kenneth I; Rong, Feng; Maddox, Dale; Saberi, Kourosh; Hickok, Gregory
The human superior temporal sulcus (STS) is responsive to visual and auditory information, including sounds and facial cues during speech recognition. We investigated the functional organization of STS with respect to modality-specific and multimodal speech representations. Twenty younger adult participants were instructed to perform an oddball detection task and were presented with auditory, visual, and audiovisual speech stimuli, as well as auditory and visual nonspeech control stimuli in a block fMRI design. Consistent with a hypothesized anterior-posterior processing gradient in STS, auditory, visual and audiovisual stimuli produced the largest BOLD effects in anterior, posterior and middle STS (mSTS), respectively, based on whole-brain, linear mixed effects and principal component analyses. Notably, the mSTS exhibited preferential responses to multisensory stimulation, as well as speech compared to nonspeech. Within the mid-posterior and mSTS regions, response preferences changed gradually from visual, to multisensory, to auditory moving posterior to anterior. Post hoc analysis of visual regions in the posterior STS revealed that a single subregion bordering the mSTS was insensitive to differences in low-level motion kinematics yet distinguished between visual speech and nonspeech based on multi-voxel activation patterns. These results suggest that auditory and visual speech representations are elaborated gradually within anterior and posterior processing streams, respectively, and may be integrated within the mSTS, which is sensitive to more abstract speech information within and across presentation modalities. The spatial organization of STS is consistent with processing streams that are hypothesized to synthesize perceptual speech representations from sensory signals that provide convergent information from visual and auditory modalities.
Full Text Available OBJECTIVE: To analyze speech reading through Internet video calls by profoundly hearing-impaired individuals and cochlear implant (CI users. METHODS: Speech reading skills of 14 deaf adults and 21 CI users were assessed using the Hochmair Schulz Moser (HSM sentence test. We presented video simulations using different video resolutions (1280 × 720, 640 × 480, 320 × 240, 160 × 120 px, frame rates (30, 20, 10, 7, 5 frames per second (fps, speech velocities (three different speakers, webcameras (Logitech Pro9000, C600 and C500 and image/sound delays (0-500 ms. All video simulations were presented with and without sound and in two screen sizes. Additionally, scores for live Skype™ video connection and live face-to-face communication were assessed. RESULTS: Higher frame rate (>7 fps, higher camera resolution (>640 × 480 px and shorter picture/sound delay (<100 ms were associated with increased speech perception scores. Scores were strongly dependent on the speaker but were not influenced by physical properties of the camera optics or the full screen mode. There is a significant median gain of +8.5%pts (p = 0.009 in speech perception for all 21 CI-users if visual cues are additionally shown. CI users with poor open set speech perception scores (n = 11 showed the greatest benefit under combined audio-visual presentation (median speech perception +11.8%pts, p = 0.032. CONCLUSION: Webcameras have the potential to improve telecommunication of hearing-impaired individuals.
Maidment, David W.; Kang, Hi Jee; Stewart, Hannah J.; Amitay, Sygal
Purpose: The study explored whether visual information improves speech identification in typically developing children with normal hearing when the auditory signal is spectrally degraded. Method: Children (n = 69) and adults (n = 15) were presented with noise-vocoded sentences from the Children's Co-ordinate Response Measure (Rosen, 2011) in…
Jerger, Susan; Tye-Murray, Nancy; Damian, Markus F; Abdi, Hervé
This research studied whether the mode of input (auditory versus audiovisual) influenced semantic access by speech in children with sensorineural hearing impairment (HI). Participants, 31 children with HI and 62 children with normal hearing (NH), were tested with the authors' new multimodal picture word task. Children were instructed to name pictures displayed on a monitor and ignore auditory or audiovisual speech distractors. The semantic content of the distractors was varied to be related versus unrelated to the pictures (e.g., picture distractor of dog-bear versus dog-cheese, respectively). In children with NH, picture-naming times were slower in the presence of semantically related distractors. This slowing, called semantic interference, is attributed to the meaning-related picture-distractor entries competing for selection and control of the response (the lexical selection by competition hypothesis). Recently, a modification of the lexical selection by competition hypothesis, called the competition threshold (CT) hypothesis, proposed that (1) the competition between the picture-distractor entries is determined by a threshold, and (2) distractors with experimentally reduced fidelity cannot reach the CT. Thus, semantically related distractors with reduced fidelity do not produce the normal interference effect, but instead no effect or semantic facilitation (faster picture naming times for semantically related versus unrelated distractors). Facilitation occurs because the activation level of the semantically related distractor with reduced fidelity (1) is not sufficient to exceed the CT and produce interference but (2) is sufficient to activate its concept, which then strengthens the activation of the picture and facilitates naming. This research investigated whether the proposals of the CT hypothesis generalize to the auditory domain, to the natural degradation of speech due to HI, and to participants who are children. Our multimodal picture word task allowed us
Salmi, Juha; Koistinen, Olli-Pekka; Glerean, Enrico; Jylänki, Pasi; Vehtari, Aki; Jääskeläinen, Iiro P; Mäkelä, Sasu; Nummenmaa, Lauri; Nummi-Kuisma, Katarina; Nummi, Ilari; Sams, Mikko
During a conversation or when listening to music, auditory and visual information are combined automatically into audiovisual objects. However, it is still poorly understood how specific type of visual information shapes neural processing of sounds in lifelike stimulus environments. Here we applied multi-voxel pattern analysis to investigate how naturally matching visual input modulates supratemporal cortex activity during processing of naturalistic acoustic speech, singing and instrumental music. Bayesian logistic regression classifiers with sparsity-promoting priors were trained to predict whether the stimulus was audiovisual or auditory, and whether it contained piano playing, speech, or singing. The predictive performances of the classifiers were tested by leaving one participant at a time for testing and training the model using the remaining 15 participants. The signature patterns associated with unimodal auditory stimuli encompassed distributed locations mostly in the middle and superior temporal gyrus (STG/MTG). A pattern regression analysis, based on a continuous acoustic model, revealed that activity in some of these MTG and STG areas were associated with acoustic features present in speech and music stimuli. Concurrent visual stimulus modulated activity in bilateral MTG (speech), lateral aspect of right anterior STG (singing), and bilateral parietal opercular cortex (piano). Our results suggest that specific supratemporal brain areas are involved in processing complex natural speech, singing, and piano playing, and other brain areas located in anterior (facial speech) and posterior (music-related hand actions) supratemporal cortex are influenced by related visual information. Those anterior and posterior supratemporal areas have been linked to stimulus identification and sensory-motor integration, respectively. Copyright © 2017 Elsevier Inc. All rights reserved.
Lynne E Bernstein
Full Text Available This paper examines the questions, what levels of speech can be perceived visually, and how is visual speech represented by the brain? Review of the literature leads to the conclusions that every level of psycholinguistic speech structure (i.e., phonetic features, phonemes, syllables, words, and prosody can be perceived visually, although individuals differ in their abilities to do so; and that there are visual modality-specific representations of speech qua speech in higher-level vision brain areas. That is, the visual system represents the modal patterns of visual speech. The suggestion that the auditory speech pathway receives and represents visual speech is examined in light of neuroimaging evidence on the auditory speech pathways. We outline the generally agreed-upon organization of the visual ventral and dorsal pathways and examine several types of visual processing that might be related to speech through those pathways, specifically, face and body, orthography, and sign language processing. In this context, we examine the visual speech processing literature, which reveals widespread diverse patterns activity in posterior temporal cortices in response to visual speech stimuli. We outline a model of the visual and auditory speech pathways and make several suggestions: (1 The visual perception of speech relies on visual pathway representations of speech qua speech. (2 A proposed site of these representations, the temporal visual speech area (TVSA has been demonstrated in posterior temporal cortex, ventral and posterior to multisensory posterior superior temporal sulcus (pSTS. (3 Given that visual speech has dynamic and configural features, its representations in feedforward visual pathways are expected to integrate these features, possibly in TVSA.
Carbonell, Kathy M.
One of the lasting concerns in audiology is the unexplained individual differences in speech perception performance even for individuals with similar audiograms. One proposal is that there are cognitive/perceptual individual differences underlying this vulnerability and that these differences are present in normal hearing (NH) individuals but do not reveal themselves in studies that use clear speech produced in quiet (because of a ceiling effect). However, previous studies have failed to uncover cognitive/perceptual variables that explain much of the variance in NH performance on more challenging degraded speech tasks. This lack of strong correlations may be due to either examining the wrong measures (e.g., working memory capacity) or to there being no reliable differences in degraded speech performance in NH listeners (i.e., variability in performance is due to measurement noise). The proposed project has 3 aims; the first, is to establish whether there are reliable individual differences in degraded speech performance for NH listeners that are sustained both across degradation types (speech in noise, compressed speech, noise-vocoded speech) and across multiple testing sessions. The second aim is to establish whether there are reliable differences in NH listeners' ability to adapt their phonetic categories based on short-term statistics both across tasks and across sessions; and finally, to determine whether performance on degraded speech perception tasks are correlated with performance on phonetic adaptability tasks, thus establishing a possible explanatory variable for individual differences in speech perception for NH and hearing impaired listeners.
Using Moore's (1993) theory of transactional distance as a framework, this action research study explored students' perceptions of audiovisual feedback provided via screencasting as a supplement to text-only feedback. A crossover design was employed to ensure that all students experienced both text-only and text-plus-audiovisual feedback and to…
Dubois, Cyril; Otzenberger, Helene; Gounot, Daniel; Sock, Rudolph; Metz-Lutz, Marie-Noelle
In a noisy environment, visual perception of articulatory movements improves natural speech intelligibility. Parallel to phonemic processing based on auditory signal, visemic processing constitutes a counterpart based on "visemes", the distinctive visual units of speech. Aiming at investigating the neural substrates of visemic processing in a…
Full Text Available One view of speech perception is that acoustic signals are transformed into representations for pattern matching to determine linguistic structure. This process can be taken as a statistical pattern-matching problem, assuming realtively stable linguistic categories are characterized by neural representations related to auditory properties of speech that can be compared to speech input. This kind of pattern matching can be termed a passive process which implies rigidity of processingd with few demands on cognitive processing. An alternative view is that speech recognition, even in early stages, is an active process in which speech analysis is attentionally guided. Note that this does not mean consciously guided but that information-contingent changes in early auditory encoding can occur as a function of context and experience. Active processing assumes that attention, plasticity, and listening goals are important in considering how listeners cope with adverse circumstances that impair hearing by masking noise in the environment or hearing loss. Although theories of speech perception have begun to incorporate some active processing, they seldom treat early speech encoding as plastic and attentionally guided. Recent research has suggested that speech perception is the product of both feedforward and feedback interactions between a number of brain regions that include descending projections perhaps as far downstream as the cochlea. It is important to understand how the ambiguity of the speech signal and constraints of context dynamically determine cognitive resources recruited during perception including focused attention, learning, and working memory. Theories of speech perception need to go beyond the current corticocentric approach in order to account for the intrinsic dynamics of the auditory encoding of speech. In doing so, this may provide new insights into ways in which hearing disorders and loss may be treated either through augementation or
A. A. Karpov
Full Text Available We present a conceptual model, architecture and software of a multimodal system for audio-visual speech and sign language synthesis by the input text. The main components of the developed multimodal synthesis system (signing avatar are: automatic text processor for input text analysis; simulation 3D model of human's head; computer text-to-speech synthesizer; a system for audio-visual speech synthesis; simulation 3D model of human’s hands and upper body; multimodal user interface integrating all the components for generation of audio, visual and signed speech. The proposed system performs automatic translation of input textual information into speech (audio information and gestures (video information, information fusion and its output in the form of multimedia information. A user can input any grammatically correct text in Russian or Czech languages to the system; it is analyzed by the text processor to detect sentences, words and characters. Then this textual information is converted into symbols of the sign language notation. We apply international «Hamburg Notation System» - HamNoSys, which describes the main differential features of each manual sign: hand shape, hand orientation, place and type of movement. On their basis the 3D signing avatar displays the elements of the sign language. The virtual 3D model of human’s head and upper body has been created using VRML virtual reality modeling language, and it is controlled by the software based on OpenGL graphical library. The developed multimodal synthesis system is a universal one since it is oriented for both regular users and disabled people (in particular, for the hard-of-hearing and visually impaired, and it serves for multimedia output (by audio and visual modalities of input textual information.
, in the light of the rich and complex Danish sound system. The first two studies report on native adults’ perception of Danish speech sounds in quiet and noise. The third study examined the development of language-specific perception in native Danish infants at 6, 9 and 12 months of age. The book points......Little is known about the perception of speech sounds by native Danish listeners. However, the Danish sound system differs in several interesting ways from the sound systems of other languages. For instance, Danish is characterized, among other features, by a rich vowel inventory and by different...... reductions of speech sounds evident in the pronunciation of the language. This book (originally a PhD thesis) consists of three studies based on the results of two experiments. The experiments were designed to provide knowledge of the perception of Danish speech sounds by Danish adults and infants...
Full Text Available The sparse information captured by the sensory systems is used by the brain to apprehend the environment, for example, to spatially locate the source of audiovisual stimuli. This is an ill-posed inverse problem whose inherent uncertainty can be solved by jointly processing the information, as well as introducing constraints during this process, on the way this multisensory information is handled. This process and its result--the percept--depend on the contextual conditions perception takes place in. To date, perception has been investigated and modeled on the basis of either one of two of its dimensions: the percept or the temporal dynamics of the process. Here, we extend our previously proposed audiovisual perception model to predict both these dimensions to capture the phenomenon as a whole. Starting from a behavioral analysis, we use a data-driven approach to elicit a bayesian network which infers the different percepts and dynamics of the process. Context-specific independence analyses enable us to use the model's structure to directly explore how different contexts affect the way subjects handle the same available information. Hence, we establish that, while the percepts yielded by a unisensory stimulus or by the non-fusion of multisensory stimuli may be similar, they result from different processes, as shown by their differing temporal dynamics. Moreover, our model predicts the impact of bottom-up (stimulus driven factors as well as of top-down factors (induced by instruction manipulation on both the perception process and the percept itself.
Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.
Aparicio, Mario; Peigneux, Philippe; Charlier, Brigitte; Balériaux, Danielle; Kavec, Martin; Leybaert, Jacqueline
We present here the first neuroimaging data for perception of Cued Speech (CS) by deaf adults who are native users of CS. CS is a visual mode of communicating a spoken language through a set of manual cues which accompany lipreading and disambiguate it. With CS, sublexical units of the oral language are conveyed clearly and completely through the visual modality without requiring hearing. The comparison of neural processing of CS in deaf individuals with processing of audiovisual (AV) speech in normally hearing individuals represents a unique opportunity to explore the similarities and differences in neural processing of an oral language delivered in a visuo-manual vs. an AV modality. The study included deaf adult participants who were early CS users and native hearing users of French who process speech audiovisually. Words were presented in an event-related fMRI design. Three conditions were presented to each group of participants. The deaf participants saw CS words (manual + lipread), words presented as manual cues alone, and words presented to be lipread without manual cues. The hearing group saw AV spoken words, audio-alone and lipread-alone. Three findings are highlighted. First, the middle and superior temporal gyrus (excluding Heschl’s gyrus) and left inferior frontal gyrus pars triangularis constituted a common, amodal neural basis for AV and CS perception. Second, integration was inferred in posterior parts of superior temporal sulcus for audio and lipread information in AV speech, but in the occipito-temporal junction, including MT/V5, for the manual cues and lipreading in CS. Third, the perception of manual cues showed a much greater overlap with the regions activated by CS (manual + lipreading) than lipreading alone did. This supports the notion that manual cues play a larger role than lipreading for CS processing. The present study contributes to a better understanding of the role of manual cues as support of visual speech perception in the framework
Rhone, Ariane E; Nourski, Kirill V; Oya, Hiroyuki; Kawasaki, Hiroto; Howard, Matthew A; McMurray, Bob
In everyday conversation, viewing a talker's face can provide information about the timing and content of an upcoming speech signal, resulting in improved intelligibility. Using electrocorticography, we tested whether human auditory cortex in Heschl's gyrus (HG) and on superior temporal gyrus (STG) and motor cortex on precentral gyrus (PreC) were responsive to visual/gestural information prior to the onset of sound and whether early stages of auditory processing were sensitive to the visual content (speech syllable versus non-speech motion). Event-related band power (ERBP) in the high gamma band was content-specific prior to acoustic onset on STG and PreC, and ERBP in the beta band differed in all three areas. Following sound onset, we found with no evidence for content-specificity in HG, evidence for visual specificity in PreC, and specificity for both modalities in STG. These results support models of audio-visual processing in which sensory information is integrated in non-primary cortical areas.
Eramudugolla, Ranmalee; Henderson, Rachel; Mattingley, Jason B.
Integration of simultaneous auditory and visual information about an event can enhance our ability to detect that event. This is particularly evident in the perception of speech, where the articulatory gestures of the speaker's lips and face can significantly improve the listener's detection and identification of the message, especially when that…
Brooks, Cassandra J; Chan, Yu Man; Anderson, Andrew J; McKendrick, Allison M
Within each sensory modality, age-related deficits in temporal perception contribute to the difficulties older adults experience when performing everyday tasks. Since perceptual experience is inherently multisensory, older adults also face the added challenge of appropriately integrating or segregating the auditory and visual cues present in our dynamic environment into coherent representations of distinct objects. As such, many studies have investigated how older adults perform when integrating temporal information across audition and vision. This review covers both direct judgments about temporal information (the sound-induced flash illusion, temporal order, perceived synchrony, and temporal rate discrimination) and judgments regarding stimuli containing temporal information (the audiovisual bounce effect and speech perception). Although an age-related increase in integration has been demonstrated on a variety of tasks, research specifically investigating the ability of older adults to integrate temporal auditory and visual cues has produced disparate results. In this short review, we explore what factors could underlie these divergent findings. We conclude that both task-specific differences and age-related sensory loss play a role in the reported disparity in age-related effects on the integration of auditory and visual temporal information.
Brooks, Cassandra J.; Chan, Yu Man; Anderson, Andrew J.; McKendrick, Allison M.
Within each sensory modality, age-related deficits in temporal perception contribute to the difficulties older adults experience when performing everyday tasks. Since perceptual experience is inherently multisensory, older adults also face the added challenge of appropriately integrating or segregating the auditory and visual cues present in our dynamic environment into coherent representations of distinct objects. As such, many studies have investigated how older adults perform when integrating temporal information across audition and vision. This review covers both direct judgments about temporal information (the sound-induced flash illusion, temporal order, perceived synchrony, and temporal rate discrimination) and judgments regarding stimuli containing temporal information (the audiovisual bounce effect and speech perception). Although an age-related increase in integration has been demonstrated on a variety of tasks, research specifically investigating the ability of older adults to integrate temporal auditory and visual cues has produced disparate results. In this short review, we explore what factors could underlie these divergent findings. We conclude that both task-specific differences and age-related sensory loss play a role in the reported disparity in age-related effects on the integration of auditory and visual temporal information. PMID:29867415
Mantokoudis, Georgios; Dähler, Claudia; Dubach, Patrick; Kompis, Martin; Caversaccio, Marco D.; Senn, Pascal
Objective To analyze speech reading through Internet video calls by profoundly hearing-impaired individuals and cochlear implant (CI) users. Methods Speech reading skills of 14 deaf adults and 21 CI users were assessed using the Hochmair Schulz Moser (HSM) sentence test. We presented video simulations using different video resolutions (1280×720, 640×480, 320×240, 160×120 px), frame rates (30, 20, 10, 7, 5 frames per second (fps)), speech velocities (three different speakers), webcameras (Logitech Pro9000, C600 and C500) and image/sound delays (0–500 ms). All video simulations were presented with and without sound and in two screen sizes. Additionally, scores for live Skype™ video connection and live face-to-face communication were assessed. Results Higher frame rate (>7 fps), higher camera resolution (>640×480 px) and shorter picture/sound delay (<100 ms) were associated with increased speech perception scores. Scores were strongly dependent on the speaker but were not influenced by physical properties of the camera optics or the full screen mode. There is a significant median gain of +8.5%pts (p = 0.009) in speech perception for all 21 CI-users if visual cues are additionally shown. CI users with poor open set speech perception scores (n = 11) showed the greatest benefit under combined audio-visual presentation (median speech perception +11.8%pts, p = 0.032). Conclusion Webcameras have the potential to improve telecommunication of hearing-impaired individuals. PMID:23359119
Alan James Power
Full Text Available Auditory cortical oscillations have been proposed to play an important role in speech perception. It is suggested that the brain may take temporal ‘samples’ of information from the speech stream at different rates, phase-resetting ongoing oscillations so that they are aligned with similar frequency bands in the input (‘phase locking’. Information from these frequency bands is then bound together for speech perception. To date, there are no explorations of neural phase-locking and entrainment to speech input in children. However, it is clear from studies of language acquisition that infants use both visual speech information and auditory speech information in learning. In order to study neural entrainment to speech in typically-developing children, we use a rhythmic entrainment paradigm (underlying 2 Hz or delta rate based on repetition of the syllable ba, presented in either the auditory modality alone, the visual modality alone, or as auditory-visual speech (via a talking head. To ensure attention to the task, children aged 13 years were asked to press a button as fast as possible when the ba stimulus violated the rhythm for each stream type. Rhythmic violation depended on delaying the occurrence of a ba in the isochronous stream. Neural entrainment was demonstrated for all stream types, and individual differences in standardized measures of language processing were related to auditory entrainment at the theta rate. Further, there was significant modulation of the preferred phase of auditory entrainment in the theta band when visual speech cues were present, indicating cross-modal phase resetting. The rhythmic entrainment paradigm developed here offers a method for exploring individual differences in oscillatory phase locking during development. In particular, a method for assessing neural entrainment and cross-modal phase resetting would be useful for exploring developmental learning difficulties thought to involve temporal sampling
Lotto, Andrew J.; Hickok, Gregory S.; Holt, Lori L.
The discovery of mirror neurons, a class of neurons that respond when a monkey performs an action and also when the monkey observes others producing the same action, has promoted a renaissance for the Motor Theory (MT) of speech perception. This is because mirror neurons seem to accomplish the same kind of one to one mapping between perception and action that MT theorizes to be the basis of human speech communication. However, this seeming correspondence is superficial, and there are theoretical and empirical reasons to temper enthusiasm about the explanatory role mirror neurons might have for speech perception. In fact, rather than providing support for MT, mirror neurons are actually inconsistent with the central tenets of MT. PMID:19223222
Zuk, Jennifer; Iuzzini-Seigel, Jenya; Cabbage, Kathryn; Green, Jordan R.; Hogan, Tiffany P.
Purpose: Childhood apraxia of speech (CAS) is hypothesized to arise from deficits in speech motor planning and programming, but the influence of abnormal speech perception in CAS on these processes is debated. This study examined speech perception abilities among children with CAS with and without language impairment compared to those with…
Ross, Lars A; Molholm, Sophie; Blanco, Daniella; Gomez-Ramirez, Manuel; Saint-Amour, Dave; Foxe, John J
Observing a speaker's articulations substantially improves the intelligibility of spoken speech, especially under noisy listening conditions. This multisensory integration of speech inputs is crucial to effective communication. Appropriate development of this ability has major implications for children in classroom and social settings, and deficits in it have been linked to a number of neurodevelopmental disorders, especially autism. It is clear from structural imaging studies that there is a prolonged maturational course within regions of the perisylvian cortex that persists into late childhood, and these regions have been firmly established as being crucial to speech and language functions. Given this protracted maturational timeframe, we reasoned that multisensory speech processing might well show a similarly protracted developmental course. Previous work in adults has shown that audiovisual enhancement in word recognition is most apparent within a restricted range of signal-to-noise ratios (SNRs). Here, we investigated when these properties emerge during childhood by testing multisensory speech recognition abilities in typically developing children aged between 5 and 14 years, and comparing them with those of adults. By parametrically varying SNRs, we found that children benefited significantly less from observing visual articulations, displaying considerably less audiovisual enhancement. The findings suggest that improvement in the ability to recognize speech-in-noise and in audiovisual integration during speech perception continues quite late into the childhood years. The implication is that a considerable amount of multisensory learning remains to be achieved during the later schooling years, and that explicit efforts to accommodate this learning may well be warranted. European Journal of Neuroscience © 2011 Federation of European Neuroscience Societies and Blackwell Publishing Ltd. No claim to original US government works.
Gordon, Peter C.
Four experiments addressing the role of attention in phonetic perception are reported. The first experiment shows that the relative importance of two cues to the voicing distinction changes when subjects must perform an arithmetic distractor task at the same time as identifying a speech stimulus. The voice onset time cue loses phonetic significance when subjects are distracted, while the F0 onset frequency cue does not. The second experiment shows a similar pattern for two cues to the distinction between the vowels /i/ (as in 'beat') and /I/ (as in 'bit'). Together these experiments indicate that careful attention to speech perception is necessary for strong acoustic cues to achieve their full phonetic impact, while weaker acoustic cues achieve their full phonetic impact without close attention. Experiment 3 shows that this pattern is obtained when the distractor task places little demand on verbal short term memory. Experiment 4 provides a large data set for testing formal models of the role of attention in speech perception. Attention is shown to influence the signal to noise ratio in phonetic encoding. This principle is instantiated in a network model in which the role of attention is to reduce noise in the phonetic encoding of acoustic cues. Implications of this work for understanding speech perception and general theories of the role of attention in perception are discussed.
Stewart, Darryl; Seymour, Rowan; Pass, Adrian; Ming, Ji
This paper presents the maximum weighted stream posterior (MWSP) model as a robust and efficient stream integration method for audio-visual speech recognition in environments, where the audio or video streams may be subjected to unknown and time-varying corruption. A significant advantage of MWSP is that it does not require any specific measurements of the signal in either stream to calculate appropriate stream weights during recognition, and as such it is modality-independent. This also means that MWSP complements and can be used alongside many of the other approaches that have been proposed in the literature for this problem. For evaluation we used the large XM2VTS database for speaker-independent audio-visual speech recognition. The extensive tests include both clean and corrupted utterances with corruption added in either/both the video and audio streams using a variety of types (e.g., MPEG-4 video compression) and levels of noise. The experiments show that this approach gives excellent performance in comparison to another well-known dynamic stream weighting approach and also compared to any fixed-weighted integration approach in both clean conditions or when noise is added to either stream. Furthermore, our experiments show that the MWSP approach dynamically selects suitable integration weights on a frame-by-frame basis according to the level of noise in the streams and also according to the naturally fluctuating relative reliability of the modalities even in clean conditions. The MWSP approach is shown to maintain robust recognition performance in all tested conditions, while requiring no prior knowledge about the type or level of noise.
Full Text Available Initially, infants are capable of discriminating phonetic contrasts across the world’s languages. Starting between seven and ten months of age, they gradually lose this ability through a process of perceptual narrowing. Although traditionally investigated with isolated speech sounds, such narrowing occurs in a variety of perceptual domains (e.g., faces, visual speech. Thus far, tracking the developmental trajectory of this tuning process has been focused primarily on auditory speech alone, and generally using isolated sounds. But infants learn from speech produced by people talking to them, meaning they learn from a complex audiovisual signal. Here, we use near-infrared spectroscopy to measure blood concentration changes in the bilateral temporal cortices of infants in three different age groups: 3-to-6 months, 7-to-10 months, and 11-to-14-months. Critically, all three groups of infants were tested with continuous audiovisual speech in both their native and another, unfamiliar language. We found that at each age range, infants showed different patterns of cortical activity in response to the native and non-native stimuli. Infants in the youngest group showed bilateral cortical activity that was greater overall in response to non-native relative to native speech; the oldest group showed left lateralized activity in response to native relative to non-native speech. These results highlight perceptual tuning as a dynamic process that happens across modalities and at different levels of stimulus complexity.
Elena V Kushnerenko
Full Text Available The use of visual cues during the processing of audiovisual speech is known to be less efficient in children and adults with language difficulties and difficulties are known to be more prevalent in children from low-income populations. In the present study, we followed an economically diverse group of thirty-seven infants longitudinally from 6-9 months to 14-16 months of age. We used eye-tracking to examine whether individual differences in visual attention during audiovisual processing of speech in 6 to 9 month old infants, particularly when processing congruent and incongruent auditory and visual speech cues, might be indicative of their later language development. Twenty-two of these 6-9 month old infants also participated in an event-related potential (ERP audiovisual task within the same experimental session. Language development was then followed-up at the age of 14-16 months, using two measures of language development, the Preschool Language Scale (PLS and the Oxford Communicative Development Inventory (CDI. The results show that those infants who were less efficient in auditory speech processing at the age of 6-9 months had lower receptive language scores at 14-16 months. A correlational analysis revealed that the pattern of face scanning and ERP responses to audio-visually incongruent stimuli at 6-9 months were both significantly associated with language development at 14-16 months. These findings add to the understanding of individual differences in neural signatures of audiovisual processing and associated looking behaviour in infants.
Marsdin, Emma; Noble, Jeremy G; Reynard, John M; Turney, Benjamin W
Lithotripsy is an established method to fragment kidney stones that can be performed without general anesthesia in the outpatient setting. Discomfort and/or noise, however, may deter some patients. It has been demonstrated that audiovisual distraction (AV) can reduce sedoanalgesic requirements and improve patient satisfaction in nonurologic settings, but to our knowledge, this has not been investigated with lithotripsy. This randomized controlled trial was designed to test the hypothesis that AV distraction can reduce perceived pain during lithotripsy. All patients in the study received identical analgesia before a complete session of lithotripsy on a fixed-site Storz Modulith SLX F2 lithotripter. Patients were randomized to two groups: One group (n=61) received AV distraction via a wall-mounted 32″ (82 cm) television with wireless headphones; the other group (n=57) received no AV distraction. The mean intensity of treatment was comparable in both groups. Patients used a visual analogue scale (0-10) to record independent pain and distress scores and a nonverbal pain score was documented by the radiographer during the procedure (0-4). In the group that received AV distraction, all measures of pain perception were statistically lower. The patient-reported pain score was reduced from a mean of 6.1 to 2.4 (P<0.0001), and the distress score was reduced from a mean of 4.4 to 1.0 (P=0.0001). The mean nonverbal score recorded by the radiographer was reduced from 1.5 to 0.5 (<0.0001). AV distraction significantly lowered patients' reported pain and distress scores. This correlated with the nonverbal scores reported by the radiographer. We conclude that AV distraction is a simple method of improving acceptance of lithotripsy and optimizing treatment.
Stasenko, Alena; Bonn, Cory; Teghipco, Alex; Garcea, Frank E; Sweet, Catherine; Dombovy, Mary; McDonough, Joyce; Mahon, Bradford Z
The debate about the causal role of the motor system in speech perception has been reignited by demonstrations that motor processes are engaged during the processing of speech sounds. Here, we evaluate which aspects of auditory speech processing are affected, and which are not, in a stroke patient with dysfunction of the speech motor system. We found that the patient showed a normal phonemic categorical boundary when discriminating two non-words that differ by a minimal pair (e.g., ADA-AGA). However, using the same stimuli, the patient was unable to identify or label the non-word stimuli (using a button-press response). A control task showed that he could identify speech sounds by speaker gender, ruling out a general labelling impairment. These data suggest that while the motor system is not causally involved in perception of the speech signal, it may be used when other cues (e.g., meaning, context) are not available.
Vander Wyk, Brent C.; Ramsay, Gordon J.; Hudac, Caitlin M.; Jones, Warren; Lin, David; Klin, Ami; Lee, Su Mei; Pelphrey, Kevin A.
We investigated the neural basis of audio-visual processing in speech and non-speech stimuli. Physically identical auditory stimuli (speech and sinusoidal tones) and visual stimuli (animated circles and ellipses) were used in this fMRI experiment. Relative to unimodal stimuli, each of the multimodal conjunctions showed increased activation in largely non-overlapping areas. The conjunction of Ellipse and Speech, which most resembles naturalistic audiovisual speech, showed higher activation in the right inferior frontal gyrus, fusiform gyri, left posterior superior temporal sulcus, and lateral occipital cortex. The conjunction of Circle and Tone, an arbitrary audio-visual pairing with no speech association, activated middle temporal gyri and lateral occipital cortex. The conjunction of Circle and Speech showed activation in lateral occipital cortex, and the conjunction of Ellipse and Tone did not show increased activation relative to unimodal stimuli. Further analysis revealed that middle temporal regions, although identified as multimodal only in the Circle-Tone condition, were more strongly active to Ellipse-Speech or Circle-Speech, but regions that were identified as multimodal for Ellipse-Speech were always strongest for Ellipse-Speech. Our results suggest that combinations of auditory and visual stimuli may together be processed by different cortical networks, depending on the extent to which speech or non-speech percepts are evoked. PMID:20709442
Michalek, Anne M P; Watson, Silvana M; Ash, Ivan; Ringleb, Stacie; Raymer, Anastasia
This study examined the interplay among internal (e.g. attention, working memory abilities) and external (e.g. background noise, visual information) factors in individuals with and without ADHD. A 2 × 2 × 6 mixed design with correlational analyses was used to compare participant results on a standardized listening in noise sentence repetition task (QuickSin; Killion et al, 2004 ), presented in an auditory and an audiovisual condition as signal-to-noise ratio (SNR) varied from 25-0 dB and to determine individual differences in working memory capacity and short-term recall. Thirty-eight young adults without ADHD and twenty-five young adults with ADHD. Diagnosis, modality, and signal-to-noise ratio all affected the ability to process speech in noise. The interaction between the diagnosis of ADHD, the presence of visual cues, and the level of noise had an effect on a person's ability to process speech in noise. conclusion: Young adults with ADHD benefited less from visual information during noise than young adults without ADHD, an effect influenced by working memory abilities.
Stevenson, Ryan A; Baum, Sarah H; Segers, Magali; Ferber, Susanne; Barense, Morgan D; Wallace, Mark T
Speech perception in noisy environments is boosted when a listener can see the speaker's mouth and integrate the auditory and visual speech information. Autistic children have a diminished capacity to integrate sensory information across modalities, which contributes to core symptoms of autism, such as impairments in social communication. We investigated the abilities of autistic and typically-developing (TD) children to integrate auditory and visual speech stimuli in various signal-to-noise ratios (SNR). Measurements of both whole-word and phoneme recognition were recorded. At the level of whole-word recognition, autistic children exhibited reduced performance in both the auditory and audiovisual modalities. Importantly, autistic children showed reduced behavioral benefit from multisensory integration with whole-word recognition, specifically at low SNRs. At the level of phoneme recognition, autistic children exhibited reduced performance relative to their TD peers in auditory, visual, and audiovisual modalities. However, and in contrast to their performance at the level of whole-word recognition, both autistic and TD children showed benefits from multisensory integration for phoneme recognition. In accordance with the principle of inverse effectiveness, both groups exhibited greater benefit at low SNRs relative to high SNRs. Thus, while autistic children showed typical multisensory benefits during phoneme recognition, these benefits did not translate to typical multisensory benefit of whole-word recognition in noisy environments. We hypothesize that sensory impairments in autistic children raise the SNR threshold needed to extract meaningful information from a given sensory input, resulting in subsequent failure to exhibit behavioral benefits from additional sensory information at the level of whole-word recognition. Autism Res 2017. © 2017 International Society for Autism Research, Wiley Periodicals, Inc. Autism Res 2017, 10: 1280-1290. © 2017 International
de Kok, D.A.; Jonkers, R.; Bastiaanse, Y.R.M.
Individuals with aphasia have more problems detecting small differences between speech sounds than larger ones. This paper reports how phonemic processing is impaired and how this is influenced by speechreading. A non-word discrimination task was carried out with 'audiovisual', 'auditory only' and
Kenney, Mary Kay; Barac-Cikoja, Dragana; Finnegan, Kimberly; Jeffries, Neal; Ludlow, Christy L.
Children with developmental speech disorders may have additional deficits in speech perception and/or short-term memory. To determine whether these are only transient developmental delays that can accompany the disorder in childhood or persist as part of the speech disorder, adults with a persistent familial speech disorder were tested on speech…
Full Text Available This present study investigated the link between speech-in-speech perception capacities and four executive function components: response suppression, inhibitory control, switching and working memory. We constructed a cross-modal semantic priming paradigm using a written target word and a spoken prime word, implemented in one of two concurrent auditory sentences (cocktail party situation. The prime and target were semantically related or unrelated. Participants had to perform a lexical decision task on visual target words and simultaneously listen to only one of two pronounced sentences. The attention of the participant was manipulated: The prime was in the pronounced sentence listened to by the participant or in the ignored one. In addition, we evaluate the executive function abilities of participants (switching cost, inhibitory-control cost and response-suppression cost and their working memory span. Correlation analyses were performed between the executive and priming measurements. Our results showed a significant interaction effect between attention and semantic priming. We observed a significant priming effect in the attended but not in the ignored condition. Only priming effects obtained in the ignored condition were significantly correlated with some of the executive measurements. However, no correlation between priming effects and working memory capacity was found. Overall, these results confirm, first, the role of attention for semantic priming effect and, second, the implication of executive functions in speech-in-noise understanding capacities.
Good, Arla; Gordon, Karen A; Papsin, Blake C; Nespoli, Gabe; Hopyan, Talar; Peretz, Isabelle; Russo, Frank A
Children who use cochlear implants (CIs) have characteristic pitch processing deficits leading to impairments in music perception and in understanding emotional intention in spoken language. Music training for normal-hearing children has previously been shown to benefit perception of emotional prosody. The purpose of the present study was to assess whether deaf children who use CIs obtain similar benefits from music training. We hypothesized that music training would lead to gains in auditory processing and that these gains would transfer to emotional speech prosody perception. Study participants were 18 child CI users (ages 6 to 15). Participants received either 6 months of music training (i.e., individualized piano lessons) or 6 months of visual art training (i.e., individualized painting lessons). Measures of music perception and emotional speech prosody perception were obtained pre-, mid-, and post-training. The Montreal Battery for Evaluation of Musical Abilities was used to measure five different aspects of music perception (scale, contour, interval, rhythm, and incidental memory). The emotional speech prosody task required participants to identify the emotional intention of a semantically neutral sentence under audio-only and audiovisual conditions. Music training led to improved performance on tasks requiring the discrimination of melodic contour and rhythm, as well as incidental memory for melodies. These improvements were predominantly found from mid- to post-training. Critically, music training also improved emotional speech prosody perception. Music training was most advantageous in audio-only conditions. Art training did not lead to the same improvements. Music training can lead to improvements in perception of music and emotional speech prosody, and thus may be an effective supplementary technique for supporting auditory rehabilitation following cochlear implantation.
Gordon, Karen A.; Papsin, Blake C.; Nespoli, Gabe; Hopyan, Talar; Peretz, Isabelle; Russo, Frank A.
Objectives: Children who use cochlear implants (CIs) have characteristic pitch processing deficits leading to impairments in music perception and in understanding emotional intention in spoken language. Music training for normal-hearing children has previously been shown to benefit perception of emotional prosody. The purpose of the present study was to assess whether deaf children who use CIs obtain similar benefits from music training. We hypothesized that music training would lead to gains in auditory processing and that these gains would transfer to emotional speech prosody perception. Design: Study participants were 18 child CI users (ages 6 to 15). Participants received either 6 months of music training (i.e., individualized piano lessons) or 6 months of visual art training (i.e., individualized painting lessons). Measures of music perception and emotional speech prosody perception were obtained pre-, mid-, and post-training. The Montreal Battery for Evaluation of Musical Abilities was used to measure five different aspects of music perception (scale, contour, interval, rhythm, and incidental memory). The emotional speech prosody task required participants to identify the emotional intention of a semantically neutral sentence under audio-only and audiovisual conditions. Results: Music training led to improved performance on tasks requiring the discrimination of melodic contour and rhythm, as well as incidental memory for melodies. These improvements were predominantly found from mid- to post-training. Critically, music training also improved emotional speech prosody perception. Music training was most advantageous in audio-only conditions. Art training did not lead to the same improvements. Conclusions: Music training can lead to improvements in perception of music and emotional speech prosody, and thus may be an effective supplementary technique for supporting auditory rehabilitation following cochlear implantation. PMID:28085739
Shahin, Antoine J.
Does musical training affect our perception of speech? For example, does learning to play a musical instrument modify the neural circuitry for auditory processing in a way that improves one’s ability to perceive speech more clearly in noisy environments? If so, can speech perception in individuals with hearing loss, who struggle in noisy situations, benefit from musical training? While music and speech exhibit some specialization in neural processing, there is evidence suggesting that skill...
This book presents a new approach to examining perceived quality of audiovisual sequences. It uses electroencephalography to understand how exactly user quality judgments are formed within a test participant, and what might be the physiologically-based implications when being exposed to lower quality media. The book redefines experimental paradigms of using EEG in the area of quality assessment so that they better suit the requirements of standard subjective quality testings. Therefore, experimental protocols and stimuli are adjusted accordingly. .
Campos-Sánchez, Antonio; López-Núñez, Juan-Antonio; Scionti, Giuseppe; Garzón, Ingrid; González-Andrades, Miguel; Alaminos, Miguel; Sola, Tomás
Videos can be used as didactic tools for self-learning under several circumstances, including those cases in which students are responsible for the development of this resource as an audiovisual notebook. We compared students' and teachers' perceptions regarding the main features that an audiovisual notebook should include. Four…
Jesse, Alexandra; Johnson, Elizabeth K
Analyses of caregiver-child communication suggest that an adult tends to highlight objects in a child's visual scene by moving them in a manner that is temporally aligned with the adult's speech productions. Here, we used the looking-while-listening paradigm to examine whether 25-month-olds use audiovisual temporal alignment to disambiguate and learn novel word-referent mappings in a difficult word-learning task. Videos of two equally interesting and animated novel objects were simultaneously presented to children, but the movement of only one of the objects was aligned with an accompanying object-labeling audio track. No social cues (e.g., pointing, eye gaze, touch) were available to the children because the speaker was edited out of the videos. Immediately afterward, toddlers were presented with still images of the two objects and asked to look at one or the other. Toddlers looked reliably longer to the labeled object, demonstrating their acquisition of the novel word-referent mapping. A control condition showed that children's performance was not solely due to the single unambiguous labeling that had occurred at experiment onset. We conclude that the temporal link between a speaker's utterances and the motion they imposed on the referent object helps toddlers to deduce a speaker's intended reference in a difficult word-learning scenario. In combination with our previous work, these findings suggest that intersensory redundancy is a source of information used by language users of all ages. That is, intersensory redundancy is not just a word-learning tool used by young infants. Copyright © 2015 Elsevier Inc. All rights reserved.
Jahn, Kelly N; Stevenson, Ryan A; Wallace, Mark T
Despite significant improvements in speech perception abilities following cochlear implantation, many prelingually deafened cochlear implant (CI) recipients continue to rely heavily on visual information to develop speech and language. Increased reliance on visual cues for understanding spoken language could lead to the development of unique audiovisual integration and visual-only processing abilities in these individuals. Brain imaging studies have demonstrated that good CI performers, as indexed by auditory-only speech perception abilities, have different patterns of visual cortex activation in response to visual and auditory stimuli as compared with poor CI performers. However, no studies have examined whether speech perception performance is related to any type of visual processing abilities following cochlear implantation. The purpose of the present study was to provide a preliminary examination of the relationship between clinical, auditory-only speech perception tests, and visual temporal acuity in prelingually deafened adult CI users. It was hypothesized that prelingually deafened CI users, who exhibit better (i.e., more acute) visual temporal processing abilities would demonstrate better auditory-only speech perception performance than those with poorer visual temporal acuity. Ten prelingually deafened adult CI users were recruited for this study. Participants completed a visual temporal order judgment task to quantify visual temporal acuity. To assess auditory-only speech perception abilities, participants completed the consonant-nucleus-consonant word recognition test and the AzBio sentence recognition test. Results were analyzed using two-tailed partial Pearson correlations, Spearman's rho correlations, and independent samples t tests. Visual temporal acuity was significantly correlated with auditory-only word and sentence recognition abilities. In addition, proficient CI users, as assessed via auditory-only speech perception performance, demonstrated
Morrill, Ryan J.; Paukner, Annika; Ferrari, Pier F.; Ghazanfar, Asif A.
Across all languages studied to date, audiovisual speech exhibits a consistent rhythmic structure. This rhythm is critical to speech perception. Some have suggested that the speech rhythm evolved "de novo" in humans. An alternative account--the one we explored here--is that the rhythm of speech evolved through the modification of rhythmic facial…
De Keyser, Kim; Santens, Patrick; Bockstael, Annelies; Botteldooren, Dick; Talsma, Durk; De Vos, Stefanie; Van Cauwenberghe, Mieke; Verheugen, Femke; Corthals, Paul; De Letter, Miet
Purpose: This study investigated the possible relationship between hypokinetic speech production and speech intensity perception in patients with Parkinson's disease (PD). Method: Participants included 14 patients with idiopathic PD and 14 matched healthy controls (HCs) with normal hearing and cognition. First, speech production was objectified…
Rumbach, Anna F; Rose, Tanya A; Cheah, Mynn
To explore Australian speech-language pathologists' use of non-speech oral motor exercises, and rationales for using/not using non-speech oral motor exercises in clinical practice. A total of 124 speech-language pathologists practising in Australia, working with paediatric and/or adult clients with speech sound difficulties, completed an online survey. The majority of speech-language pathologists reported that they did not use non-speech oral motor exercises when working with paediatric or adult clients with speech sound difficulties. However, more than half of the speech-language pathologists working with adult clients who have dysarthria reported using non-speech oral motor exercises with this population. The most frequently reported rationale for using non-speech oral motor exercises in speech sound difficulty management was to improve awareness/placement of articulators. The majority of speech-language pathologists agreed there is no clear clinical or research evidence base to support non-speech oral motor exercise use with clients who have speech sound difficulties. This study provides an overview of Australian speech-language pathologists' reported use and perceptions of non-speech oral motor exercises' applicability and efficacy in treating paediatric and adult clients who have speech sound difficulties. The research findings provide speech-language pathologists with insight into how and why non-speech oral motor exercises are currently used, and adds to the knowledge base regarding Australian speech-language pathology practice of non-speech oral motor exercises in the treatment of speech sound difficulties. Implications for Rehabilitation Non-speech oral motor exercises refer to oral motor activities which do not involve speech, but involve the manipulation or stimulation of oral structures including the lips, tongue, jaw, and soft palate. Non-speech oral motor exercises are intended to improve the function (e.g., movement, strength) of oral structures. The
Li, Yuanqing; Long, Jinyi; Huang, Biao; Yu, Tianyou; Wu, Wei; Liu, Yongjian; Liang, Changhong; Sun, Pei
Previous studies have shown that audiovisual integration improves identification performance and enhances neural activity in heteromodal brain areas, for example, the posterior superior temporal sulcus/middle temporal gyrus (pSTS/MTG). Furthermore, it has also been demonstrated that attention plays an important role in crossmodal integration. In this study, we considered crossmodal integration in audiovisual facial perception and explored its effect on the neural representation of features. The audiovisual stimuli in the experiment consisted of facial movie clips that could be classified into 2 gender categories (male vs. female) or 2 emotion categories (crying vs. laughing). The visual/auditory-only stimuli were created from these movie clips by removing the auditory/visual contents. The subjects needed to make a judgment about the gender/emotion category for each movie clip in the audiovisual, visual-only, or auditory-only stimulus condition as functional magnetic resonance imaging (fMRI) signals were recorded. The neural representation of the gender/emotion feature was assessed using the decoding accuracy and the brain pattern-related reproducibility indices, obtained by a multivariate pattern analysis method from the fMRI data. In comparison to the visual-only and auditory-only stimulus conditions, we found that audiovisual integration enhanced the neural representation of task-relevant features and that feature-selective attention might play a role of modulation in the audiovisual integration. © The Author 2013. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: firstname.lastname@example.org.
Giroux, Ibrahima; Rey, Arnaud
Saffran, Newport, and Aslin (1996a) found that human infants are sensitive to statistical regularities corresponding to lexical units when hearing an artificial spoken language. Two sorts of segmentation strategies have been proposed to account for this early word-segmentation ability: bracketing strategies, in which infants are assumed to insert boundaries into continuous speech, and clustering strategies, in which infants are assumed to group certain speech sequences together into units (Swingley, 2005). In the present study, we test the predictions of two computational models instantiating each of these strategies i.e., Serial Recurrent Networks: Elman, 1990; and Parser: Perruchet & Vinter, 1998 in an experiment where we compare the lexical and sublexical recognition performance of adults after hearing 2 or 10 min of an artificial spoken language. The results are consistent with Parser's predictions and the clustering approach, showing that performance on words is better than performance on part-words only after 10 min. This result suggests that word segmentation abilities are not merely due to stronger associations between sublexical units but to the emergence of stronger lexical representations during the development of speech perception processes. Copyright © 2009, Cognitive Science Society, Inc.
Mondelli, Maria Fernanda Capoani Garcia; dos Santos, Marina de Marchi; José, Maria Renata
ABSTRACT INTRODUCTION: Unilateral hearing loss is characterized by a decrease of hearing in one ear only. In the presence of ambient noise, individuals with unilateral hearing loss are faced with greater difficulties understanding speech than normal listeners. OBJECTIVE: To evaluate the speech perception of individuals with unilateral hearing loss in speech perception with and without competitive noise, before and after the hearing aid fitting process. METHODS: The study included 30 adu...
Martínez-Montes, Eduardo; Hernández-Pérez, Heivet; Chobert, Julie; Morgado-Rodríguez, Lisbet; Suárez-Murias, Carlos; Valdés-Sosa, Pedro A; Besson, Mireille
The aim of this experiment was to investigate the influence of musical expertise on the automatic perception of foreign syllables and harmonic sounds. Participants were Cuban students with high level of expertise in music or in visual arts and with the same level of general education and socio-economic background. We used a multi-feature Mismatch Negativity (MMN) design with sequences of either syllables in Mandarin Chinese or harmonic sounds, both comprising deviants in pitch contour, duration and Voice Onset Time (VOT) or equivalent that were either far from (Large deviants) or close to (Small deviants) the standard. For both Mandarin syllables and harmonic sounds, results were clear-cut in showing larger MMNs to pitch contour deviants in musicians than in visual artists. Results were less clear for duration and VOT deviants, possibly because of the specific characteristics of the stimuli. Results are interpreted as reflecting similar processing of pitch contour in speech and non-speech sounds. The implications of these results for understanding the influence of intense musical training from childhood to adulthood and of genetic predispositions for music on foreign language perception are discussed.
Full Text Available The aim of this experiment was to investigate the influence of musical expertise on the automatic perception of foreign syllables and harmonic sounds. Participants were Cuban students with high level of expertise in music or in visual arts and with the same level of general education and socio-economic background. We used a multi-feature Mismatch Negativity (MMN design with sequences of either syllables in Mandarin Chinese or harmonic sounds, both comprising deviants in pitch contour, duration and Voice Onset Time (VOT or equivalent that were either far from (Large deviants or close to (Small deviants the standard. For both Mandarin syllables and harmonic sounds, results were clear-cut in showing larger MMNs to pitch contour deviants in musicians than in visual artists. Results were less clear for duration and VOT deviants, possibly because of the specific characteristics of the stimuli. Results are interpreted as reflecting similar processing of pitch contour in speech and non-speech sounds. The implications of these results for understanding the influence of intense musical training from childhood to adulthood and of genetic predispositions for music on foreign language perception is discussed.
Full Text Available Motion perception is a pervasive nature of vision and is affected by both immediate pattern of sensory inputs and prior experiences acquired through associations. Recently, several studies reported that an association can be established quickly between directions of visual motion and static sounds of distinct frequencies. After the association is formed, sounds are able to change the perceived direction of visual motion. To determine whether such rapidly acquired audiovisual associations and their subsequent influences on visual motion perception are dependent on the involvement of higher-order attentive tracking mechanisms, we designed psychophysical experiments using regular and reverse-phi random dot motions isolating low-level pre-attentive motion processing. Our results show that an association between the directions of low-level visual motion and static sounds can be formed and this audiovisual association alters the subsequent perception of low-level visual motion. These findings support the view that audiovisual associations are not restricted to high-level attention based motion system and early-level visual motion processing has some potential role.
Greene, Beth G; Logan, John S; Pisoni, David B
We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered.
GREENE, BETH G.; LOGAN, JOHN S.; PISONI, DAVID B.
We present the results of studies designed to measure the segmental intelligibility of eight text-to-speech systems and a natural speech control, using the Modified Rhyme Test (MRT). Results indicated that the voices tested could be grouped into four categories: natural speech, high-quality synthetic speech, moderate-quality synthetic speech, and low-quality synthetic speech. The overall performance of the best synthesis system, DECtalk-Paul, was equivalent to natural speech only in terms of performance on initial consonants. The findings are discussed in terms of recent work investigating the perception of synthetic speech under more severe conditions. Suggestions for future research on improving the quality of synthetic speech are also considered. PMID:23225916
Full Text Available One of the main issues within the field of social robotics is to endow robots with the ability to direct attention to people with whom they are interacting. Different approaches follow bio-inspired mechanisms, merging audio and visual cues to localize a person using multiple sensors. However, most of these fusion mechanisms have been used in fixed systems, such as those used in video-conference rooms, and thus, they may incur difficulties when constrained to the sensors with which a robot can be equipped. Besides, within the scope of interactive autonomous robots, there is a lack in terms of evaluating the benefits of audio-visual attention mechanisms, compared to only audio or visual approaches, in real scenarios. Most of the tests conducted have been within controlled environments, at short distances and/or with off-line performance measurements. With the goal of demonstrating the benefit of fusing sensory information with a Bayes inference for interactive robotics, this paper presents a system for localizing a person by processing visual and audio data. Moreover, the performance of this system is evaluated and compared via considering the technical limitations of unimodal systems. The experiments show the promise of the proposed approach for the proactive detection and tracking of speakers in a human-robot interactive framework.
Foundations of Voice and Speech Quality Perception starts out with the fundamental question of: "How do listeners perceive voice and speech quality and how can these processes be modeled?" Any quantitative answers require measurements. This is natural for physical quantities but harder to imagine for perceptual measurands. This book approaches the problem by actually identifying major perceptual dimensions of voice and speech quality perception, defining units wherever possible and offering paradigms to position these dimensions into a structural skeleton of perceptual speech and voice quality. The emphasis is placed on voice and speech quality assessment of systems in artificial scenarios. Many scientific fields are involved. This book bridges the gap between two quite diverse fields, engineering and humanities, and establishes the new research area of Voice and Speech Quality Perception.
Shahin, Antoine J
Does musical training affect our perception of speech? For example, does learning to play a musical instrument modify the neural circuitry for auditory processing in a way that improves one's ability to perceive speech more clearly in noisy environments? If so, can speech perception in individuals with hearing loss (HL), who struggle in noisy situations, benefit from musical training? While music and speech exhibit some specialization in neural processing, there is evidence suggesting that skills acquired through musical training for specific acoustical processes may transfer to, and thereby improve, speech perception. The neurophysiological mechanisms underlying the influence of musical training on speech processing and the extent of this influence remains a rich area to be explored. A prerequisite for such transfer is the facilitation of greater neurophysiological overlap between speech and music processing following musical training. This review first establishes a neurophysiological link between musical training and speech perception, and subsequently provides further hypotheses on the neurophysiological implications of musical training on speech perception in adverse acoustical environments and in individuals with HL.
Ortiz, Rosario; Jiménez, Juan E; Muñetón, Mercedes; Rojas, Estefanía; Estévez, Adelina; Guzmán, Remedios; Rodríguez, Cristina; Naranjo, Francisco
Several studies have indicated that dyslexics show a deficit in speech perception (SP). The main purpose of this research is to determine the development of SP in dyslexics and normal readers paired by grades from 2nd to 6th grade of primary school and to know whether the phonetic contrasts that are relevant for SP change during development, taking into account the individual differences. The achievement of both groups was compared in the phonetic tasks: voicing contrast, place of articulation contrast and manner of articulation contrast. The results showed that the dyslexic performed poorer than the normal readers in SP. In place of articulation contrast, the developmental pattern is similar in both groups but not in voicing and manner of articulation. Manner of articulation has more influence on SP, and its development is higher than the other contrast tasks in both groups.
Full Text Available Congenital amusics, or tone-deaf individuals, show difficulty in perceiving and producing small pitch differences. While amusia has marked effects on music perception, its impact on speech perception is less clear. Here we test the hypothesis that individual differences in pitch perception affect judgment of emotion in speech, by applying band-pass filters to spoken statements of emotional speech. A norming study was first conducted on Mechanical Turk to ensure that the intended emotions from the Macquarie Battery for Evaluation of Prosody (MBEP were reliably identifiable by US English speakers. The most reliably identified emotional speech samples were used in in Experiment 1, in which subjects performed a psychophysical pitch discrimination task, and an emotion identification task under band-pass and unfiltered speech conditions. Results showed a significant correlation between pitch discrimination threshold and emotion identification accuracy for band-pass filtered speech, with amusics (defined here as those with a pitch discrimination threshold > 16 Hz performing worse than controls. This relationship with pitch discrimination was not seen in unfiltered speech conditions. Given the dissociation between band-pass filtered and unfiltered speech conditions, we inferred that amusics may be compensating for poorer pitch perception by using speech cues that are filtered out in this manipulation.
Hearnshaw, Stephanie; Baker, Elise; Munro, Natalie
To investigate whether Australian-English speaking children with and without speech sound disorder (SSD) differ in their overall speech perception accuracy. Additionally, to investigate differences in the perception of specific phonemes and the association between speech perception and speech production skills. Twenty-five Australian-English speaking children aged 48-60 months participated in this study. The SSD group included 12 children and the typically developing (TD) group included 13 children. Children completed routine speech and language assessments in addition to an experimental Australian-English lexical and phonetic judgement task based on Rvachew's Speech Assessment and Interactive Learning System (SAILS) program (Rvachew, 2009). This task included eight words across four word-initial phonemes-/k, ɹ, ʃ, s/. Children with SSD showed significantly poorer perceptual accuracy on the lexical and phonetic judgement task compared with TD peers. The phonemes /ɹ/ and /s/ were most frequently perceived in error across both groups. Additionally, the phoneme /ɹ/ was most commonly produced in error. There was also a positive correlation between overall speech perception and speech production scores. Children with SSD perceived speech less accurately than their typically developing peers. The findings suggest that an Australian-English variation of a lexical and phonetic judgement task similar to the SAILS program is promising and worthy of a larger scale study. Copyright © 2017 Elsevier Inc. All rights reserved.
Mitterer, Holger; Mattys, Sven L
Two experiments investigated the conditions under which cognitive load exerts an effect on the acuity of speech perception. These experiments extend earlier research by using a different speech perception task (four-interval oddity task) and by implementing cognitive load through a task often thought to be modular, namely, face processing. In the cognitive-load conditions, participants were required to remember two faces presented before the speech stimuli. In Experiment 1, performance in the speech-perception task under cognitive load was not impaired in comparison to a no-load baseline condition. In Experiment 2, we modified the load condition minimally such that it required encoding of the two faces simultaneously with the speech stimuli. As a reference condition, we also used a visual search task that in earlier experiments had led to poorer speech perception. Both concurrent tasks led to decrements in the speech task. The results suggest that speech perception is affected even by loads thought to be processed modularly, and that, critically, encoding in working memory might be the locus of interference.
Lolli, Sydney L; Lewenstein, Ari D; Basurto, Julian; Winnik, Sean; Loui, Psyche
Congenital amusics, or "tone-deaf" individuals, show difficulty in perceiving and producing small pitch differences. While amusia has marked effects on music perception, its impact on speech perception is less clear. Here we test the hypothesis that individual differences in pitch perception affect judgment of emotion in speech, by applying low-pass filters to spoken statements of emotional speech. A norming study was first conducted on Mechanical Turk to ensure that the intended emotions from the Macquarie Battery for Evaluation of Prosody were reliably identifiable by US English speakers. The most reliably identified emotional speech samples were used in Experiment 1, in which subjects performed a psychophysical pitch discrimination task, and an emotion identification task under low-pass and unfiltered speech conditions. Results showed a significant correlation between pitch-discrimination threshold and emotion identification accuracy for low-pass filtered speech, with amusics (defined here as those with a pitch discrimination threshold >16 Hz) performing worse than controls. This relationship with pitch discrimination was not seen in unfiltered speech conditions. Given the dissociation between low-pass filtered and unfiltered speech conditions, we inferred that amusics may be compensating for poorer pitch perception by using speech cues that are filtered out in this manipulation. To assess this potential compensation, Experiment 2 was conducted using high-pass filtered speech samples intended to isolate non-pitch cues. No significant correlation was found between pitch discrimination and emotion identification accuracy for high-pass filtered speech. Results from these experiments suggest an influence of low frequency information in identifying emotional content of speech.
José Antonio Palao Errando
Full Text Available En la llamada «sociedad de la información» los estudios sobre cine se han visto diluidos en el abordaje pragmático y tecnológico del discurso audiovisual, así como la propia fruición del cine se ha visto atrapada en la red del DVD y del hipertexto. El propio cine reacciona ante ello a través de estructuras narrativas complejas que lo alejan del discurso audiovisual estándar. La función de los estudios sobre cine y de su enseñanza universitaria debe ser la reintroducción del sujeto rechazado del saber informativo por medio de la interpretación del texto fílmico. In the so called «information society», film studies have been diluted in the pragmatic and technological approaching of the audiovisual speech, as well as the own fruition of the cinema has been caught in the net of DVD and hypertext. The cinema itself reacts in the face of it through complex narrative structures that take it away from the standard audio-visual speech. The function of film studies at the university education should be the reintroduction of the rejected subject of the informative knowledge by means of the interpretation of film text.
Lametti, Daniel R.; Rochet-Capellan, Amélie; Neufeld, Emily; Shiller, Douglas M.
Recent studies of human speech motor learning suggest that learning is accompanied by changes in auditory perception. But what drives the perceptual change? Is it a consequence of changes in the motor system? Or is it a result of sensory inflow during learning? Here, subjects participated in a speech motor-learning task involving adaptation to altered auditory feedback and they were subsequently tested for perceptual change. In two separate experiments, involving two different auditory perceptual continua, we show that changes in the speech motor system that accompany learning drive changes in auditory speech perception. Specifically, we obtained changes in speech perception when adaptation to altered auditory feedback led to speech production that fell into the phonetic range of the speech perceptual tests. However, a similar change in perception was not observed when the auditory feedback that subjects' received during learning fell into the phonetic range of the perceptual tests. This indicates that the central motor outflow associated with vocal sensorimotor adaptation drives changes to the perceptual classification of speech sounds. PMID:25080594
Full Text Available A growing body of evidence shows that brain oscillations track speech. This mechanism is thought to maximise processing efficiency by allocating resources to important speech information, effectively parsing speech into units of appropriate granularity for further decoding. However, some aspects of this mechanism remain unclear. First, while periodicity is an intrinsic property of this physiological mechanism, speech is only quasi-periodic, so it is not clear whether periodicity would present an advantage in processing. Second, it is still a matter of debate which aspect of speech triggers or maintains cortical entrainment, from bottom-up cues such as fluctuations of the amplitude envelope of speech to higher level linguistic cues such as syntactic structure. We present data from a behavioural experiment assessing the effect of isochronous retiming of speech on speech perception in noise. Two types of anchor points were defined for retiming speech, namely syllable onsets and amplitude envelope peaks. For each anchor point type, retiming was implemented at two hierarchical levels, a slow time scale around 2.5 Hz and a fast time scale around 4 Hz. Results show that while any temporal distortion resulted in reduced speech intelligibility, isochronous speech anchored to P-centers (approximated by stressed syllable vowel onsets was significantly more intelligible than a matched anisochronous retiming, suggesting a facilitative role of periodicity defined on linguistically motivated units in processing speech in noise.
Lewis, Dawna E; Smith, Nicholas A; Spalding, Jody L; Valente, Daniel L
Visual information from talkers facilitates speech intelligibility for listeners when audibility is challenged by environmental noise and hearing loss. Less is known about how listeners actively process and attend to visual information from different talkers in complex multi-talker environments. This study tracked looking behavior in children with normal hearing (NH), mild bilateral hearing loss (MBHL), and unilateral hearing loss (UHL) in a complex multi-talker environment to examine the extent to which children look at talkers and whether looking patterns relate to performance on a speech-understanding task. It was hypothesized that performance would decrease as perceptual complexity increased and that children with hearing loss would perform more poorly than their peers with NH. Children with MBHL or UHL were expected to demonstrate greater attention to individual talkers during multi-talker exchanges, indicating that they were more likely to attempt to use visual information from talkers to assist in speech understanding in adverse acoustics. It also was of interest to examine whether MBHL, versus UHL, would differentially affect performance and looking behavior. Eighteen children with NH, eight children with MBHL, and 10 children with UHL participated (8-12 years). They followed audiovisual instructions for placing objects on a mat under three conditions: a single talker providing instructions via a video monitor, four possible talkers alternately providing instructions on separate monitors in front of the listener, and the same four talkers providing both target and nontarget information. Multi-talker background noise was presented at a 5 dB signal-to-noise ratio during testing. An eye tracker monitored looking behavior while children performed the experimental task. Behavioral task performance was higher for children with NH than for either group of children with hearing loss. There were no differences in performance between children with UHL and children
Full Text Available Recent technology provides us with realistic looking virtual characters. Motion capture and elaborate mathematical models supply data for natural looking, controllable facial and bodily animations. With the help of computational linguistics and artificial intelligence, we can automatically assign emotional categories to appropriate stretches of text for a simulation of those social scenarios where verbal communication is important. All this makes virtual characters a valuable tool for creation of versatile stimuli for research on the integration of emotion information from different modalities. We conducted an audio-visual experiment to investigate the differential contributions of emotional speech and facial expressions on emotion identification. We used recorded and synthesized speech as well as dynamic virtual faces, all enhanced for seven emotional categories. The participants were asked to recognize the prevalent emotion of paired faces and audio. Results showed that when the voice was recorded, the vocalized emotion influenced participants' emotion identification more than the facial expression. However, when the voice was synthesized, facial expression influenced participants' emotion identification more than vocalized emotion. Additionally, individuals did worse on identifying either the facial expression or vocalized emotion when the voice was synthesized. Our experimental method can help to determine how to improve synthesized emotional speech.
Thompson, Elaine C.; Carr, Kali Woodruff; White-Schwoch, Travis; Otto-Meyer, Sebastian; Kraus, Nina
From bustling classrooms to unruly lunchrooms, school settings are noisy. To learn effectively in the unwelcome company of numerous distractions, children must clearly perceive speech in noise. In older children and adults, speech-in-noise perception is supported by sensory and cognitive processes, but the correlates underlying this critical listening skill in young children (3–5 year olds) remain undetermined. Employing a longitudinal design (two evaluations separated by ~12 months), we followed a cohort of 59 preschoolers, ages 3.0–4.9, assessing word-in-noise perception, cognitive abilities (intelligence, short-term memory, attention), and neural responses to speech. Results reveal changes in word-in-noise perception parallel changes in processing of the fundamental frequency (F0), an acoustic cue known for playing a role central to speaker identification and auditory scene analysis. Four unique developmental trajectories (speech-in-noise perception groups) confirm this relationship, in that improvements and declines in word-in-noise perception couple with enhancements and diminishments of F0 encoding, respectively. Improvements in word-in-noise perception also pair with gains in attention. Word-in-noise perception does not relate to strength of neural harmonic representation or short-term memory. These findings reinforce previously-reported roles of F0 and attention in hearing speech in noise in older children and adults, and extend this relationship to preschool children. PMID:27864051
Grover, C.N.; Terken, J.M.B.
Modelling rhythmic characteristics of speech is expected to contribute to the acceptability of synthetic speech. However, before rules for the control of speech rhythm in synthetic speech can be developed, we need to know which properties of speech give rise to the perception of speech rhythm. An
This book interconnects two essential disciplines to study the perception of speech: Neuroscience and Quality of Experience, which to date have rarely been used together for the purposes of research on speech quality perception. In five key experiments, the book demonstrates the application of standard clinical methods in neurophysiology on the one hand, and of methods used in fields of research concerned with speech quality perception on the other. Using this combination, the book shows that speech stimuli with different lengths and different quality impairments are accompanied by physiological reactions related to quality variations, e.g., a positive peak in an event-related potential. Furthermore, it demonstrates that – in most cases – quality impairment intensity has an impact on the intensity of physiological reactions.
Full Text Available This fMRI study examines shared and distinct cortical areas involved in the auditory perception of song and speech at the level of their underlying constituents: words, pitch and rhythm. Univariate and multivariate analyses were performed on the brain activity patterns of six conditions, arranged in a subtractive hierarchy: sung sentences including words, pitch and rhythm; hummed speech prosody and song melody containing only pitch patterns and rhythm; as well as the pure musical or speech rhythm.Systematic contrasts between these balanced conditions following their hierarchical organization showed a great overlap between song and speech at all levels in the bilateral temporal lobe, but suggested a differential role of the inferior frontal gyrus (IFG and intraparietal sulcus (IPS in processing song and speech. The left IFG was involved in word- and pitch-related processing in speech, the right IFG in processing pitch in song.Furthermore, the IPS showed sensitivity to discrete pitch relations in song as opposed to the gliding pitch in speech. Finally, the superior temporal gyrus and premotor cortex coded for general differences between words and pitch patterns, irrespective of whether they were sung or spoken. Thus, song and speech share many features which are reflected in a fundamental similarity of brain areas involved in their perception. However, fine-grained acoustic differences on word and pitch level are reflected in the activity of IFG and IPS.
Moradi, Shahram; Wahlin, Anna; Hällgren, Mathias; Rönnberg, Jerker; Lidestam, Björn
This study aimed to examine the efficacy and maintenance of short-term (one-session) gated audiovisual speech training for improving auditory sentence identification in noise in experienced elderly hearing-aid users. Twenty-five hearing aid users (16 men and 9 women), with an average age of 70.8 years, were randomly divided into an experimental (audiovisual training, n = 14) and a control (auditory training, n = 11) group. Participants underwent gated speech identification tasks comprising Swedish consonants and words presented at 65 dB sound pressure level with a 0 dB signal-to-noise ratio (steady-state broadband noise), in audiovisual or auditory-only training conditions. The Hearing-in-Noise Test was employed to measure participants’ auditory sentence identification in noise before the training (pre-test), promptly after training (post-test), and 1 month after training (one-month follow-up). The results showed that audiovisual training improved auditory sentence identification in noise promptly after the training (post-test vs. pre-test scores); furthermore, this improvement was maintained 1 month after the training (one-month follow-up vs. pre-test scores). Such improvement was not observed in the control group, neither promptly after the training nor at the one-month follow-up. However, no significant between-groups difference nor an interaction between groups and session was observed. Conclusion: Audiovisual training may be considered in aural rehabilitation of hearing aid users to improve listening capabilities in noisy conditions. However, the lack of a significant between-groups effect (audiovisual vs. auditory) or an interaction between group and session calls for further research. PMID:28348542
Poeppel, David; Idsardi, William J; van Wassenhove, Virginie
Speech perception consists of a set of computations that take continuously varying acoustic waveforms as input and generate discrete representations that make contact with the lexical representations stored in long-term memory as output. Because the perceptual objects that are recognized by the speech perception enter into subsequent linguistic computation, the format that is used for lexical representation and processing fundamentally constrains the speech perceptual processes. Consequently, theories of speech perception must, at some level, be tightly linked to theories of lexical representation. Minimally, speech perception must yield representations that smoothly and rapidly interface with stored lexical items. Adopting the perspective of Marr, we argue and provide neurobiological and psychophysical evidence for the following research programme. First, at the implementational level, speech perception is a multi-time resolution process, with perceptual analyses occurring concurrently on at least two time scales (approx. 20-80 ms, approx. 150-300 ms), commensurate with (sub)segmental and syllabic analyses, respectively. Second, at the algorithmic level, we suggest that perception proceeds on the basis of internal forward models, or uses an 'analysis-by-synthesis' approach. Third, at the computational level (in the sense of Marr), the theory of lexical representation that we adopt is principally informed by phonological research and assumes that words are represented in the mental lexicon in terms of sequences of discrete segments composed of distinctive features. One important goal of the research programme is to develop linking hypotheses between putative neurobiological primitives (e.g. temporal primitives) and those primitives derived from linguistic inquiry, to arrive ultimately at a biologically sensible and theoretically satisfying model of representation and computation in speech.
Pratt, Hillel; Bleich, Naomi; Mittelman, Nomi
Spatio-temporal distributions of cortical activity to audio-visual presentations of meaningless vowel-consonant-vowels and the effects of audio-visual congruence/incongruence, with emphasis on the McGurk effect, were studied. The McGurk effect occurs when a clearly audible syllable with one consonant, is presented simultaneously with a visual presentation of a face articulating a syllable with a different consonant and the resulting percept is a syllable with a consonant other than the auditorily presented one. Twenty subjects listened to pairs of audio-visually congruent or incongruent utterances and indicated whether pair members were the same or not. Source current densities of event-related potentials to the first utterance in the pair were estimated and effects of stimulus-response combinations, brain area, hemisphere, and clarity of visual articulation were assessed. Auditory cortex, superior parietal cortex, and middle temporal cortex were the most consistently involved areas across experimental conditions. Early (visual cortex. Clarity of visual articulation impacted activity in secondary visual cortex and Wernicke's area. McGurk perception was associated with decreased activity in primary and secondary auditory cortices and Wernicke's area before 100 msec, increased activity around 100 msec which decreased again around 180 msec. Activity in Broca's area was unaffected by McGurk perception and was only increased to congruent audio-visual stimuli 30-70 msec following consonant onset. The results suggest left hemisphere prominence in the effects of stimulus and response conditions on eight brain areas involved in dynamically distributed parallel processing of audio-visual integration. Initially (30-70 msec) subcortical contributions to auditory cortex, superior parietal cortex, and middle temporal cortex occur. During 100-140 msec, peristriate visual influences and Wernicke's area join in the processing. Resolution of incongruent audio-visual inputs is then
Jiang, Jintao; Bernstein, Lynne E.
When the auditory and visual components of spoken audiovisual nonsense syllables are mismatched, perceivers produce four different types of perceptual responses, auditory correct, visual correct, fusion (the so-called McGurk effect), and combination (i.e., two consonants are reported). Here, quantitative measures were developed to account for the distribution of types of perceptual responses to 384 different stimuli from four talkers. The measures included mutual information, the presented acoustic signal versus the acoustic signal recorded with the presented video, and the correlation between the presented acoustic and video stimuli. In Experiment 1, open-set perceptual responses were obtained for acoustic /bA/ or /lA/ dubbed to video /bA, dA, gA, vA, zA, lA, wA, ΔA/. The talker, the video syllable, and the acoustic syllable significantly influenced the type of response. In Experiment 2, the best predictors of response category proportions were a subset of the physical stimulus measures, with the variance accounted for in the perceptual response category proportions between 17% and 52%. That audiovisual stimulus relationships can account for response distributions supports the possibility that internal representations are based on modality-specific stimulus relationships. PMID:21574741
Campos-Sánchez, Antonio; López-Núñez, Juan-Antonio; Scionti, Giuseppe; Garzón, Ingrid; González-Andrades, Miguel; Alaminos, Miguel; Sola, Tomás
Videos can be used as didactic tools for self-learning under several circumstances, including those cases in which students are responsible for the development of this resource as an audiovisual notebook. We compared students' and teachers' perceptions regarding the main features that an audiovisual notebook should include. Four questionnaires with items about information, images, text and music, and filmmaking were used to investigate students' (n = 115) and teachers' perceptions (n = 28) regarding the development of a video focused on a histological technique. The results show that both students and teachers significantly prioritize informative components, images and filmmaking more than text and music. The scores were significantly higher for teachers than for students for all four components analyzed. The highest scores were given to items related to practical and medically oriented elements, and the lowest values were given to theoretical and complementary elements. For most items, there were no differences between genders. A strong positive correlation was found between the scores given to each item by teachers and students. These results show that both students' and teachers' perceptions tend to coincide for most items, and suggest that audiovisual notebooks developed by students would emphasize the same items as those perceived by teachers to be the most relevant. Further, these findings suggest that the use of video as an audiovisual learning notebook would not only preserve the curricular objectives but would also offer the advantages of self-learning processes. © 2013 American Association of Anatomists.
Moulin-Frier, Clément; Arbib, Michael A
The motor theory of speech perception holds that we perceive the speech of another in terms of a motor representation of that speech. However, when we have learned to recognize a foreign accent, it seems plausible that recognition of a word rarely involves reconstruction of the speech gestures of the speaker rather than the listener. To better assess the motor theory and this observation, we proceed in three stages. Part 1 places the motor theory of speech perception in a larger framework based on our earlier models of the adaptive formation of mirror neurons for grasping, and for viewing extensions of that mirror system as part of a larger system for neuro-linguistic processing, augmented by the present consideration of recognizing speech in a novel accent. Part 2 then offers a novel computational model of how a listener comes to understand the speech of someone speaking the listener's native language with a foreign accent. The core tenet of the model is that the listener uses hypotheses about the word the speaker is currently uttering to update probabilities linking the sound produced by the speaker to phonemes in the native language repertoire of the listener. This, on average, improves the recognition of later words. This model is neutral regarding the nature of the representations it uses (motor vs. auditory). It serve as a reference point for the discussion in Part 3, which proposes a dual-stream neuro-linguistic architecture to revisits claims for and against the motor theory of speech perception and the relevance of mirror neurons, and extracts some implications for the reframing of the motor theory.
Dornan, Dimity; Hickson, Louise; Murdoch, Bruce; Houston, Todd
This study examined the speech perception, speech, and language developmental progress of 25 children with hearing loss (mean Pure-Tone Average [PTA] 79.37 dB HL) in an auditory verbal therapy program. Children were tested initially and then 21 months later on a battery of assessments. The speech and language results over time were compared with…
Allard, Emily R.; Williams, Dale F.
Using semantic differential scales with nine trait pairs, 445 adults rated five audio-taped speech samples, one depicting an individual without a disorder and four portraying communication disorders. Statistical analyses indicated that the no disorder sample was rated higher with respect to the trait of employability than were the articulation,…
Wong, Patrick C. M.; Uppunda, Ajith K.; Parrish, Todd B.; Dhar, Sumitrajit
Purpose: The present study examines the brain basis of listening to spoken words in noise, which is a ubiquitous characteristic of communication, with the focus on the dorsal auditory pathway. Method: English-speaking young adults identified single words in 3 listening conditions while their hemodynamic response was measured using fMRI: speech in…
Accounts of speech perception disagree on whether listeners perceive the acoustic signal (Diehl, Lotto, & Holt, 2004) or the vocal tract gestures that produce the signal (e.g., Fowler, 1986). In this dissertation, I outline a research program using a phenomenon called "perceptual compensation for coarticulation" (Mann, 1980) to examine this…
Sheppard, Beth; Elliott, Nancy; Baese-Berk, Melissa
Intensive English Program (IEP) Instructors and content faculty both listen to international students at the university. For these two groups of instructors, this study compared perceptions of international student speech by collecting comprehensibility ratings and transcription samples for intelligibility scores. No significant differences were…
Klein, Kelsey E; Walker, Elizabeth A; Kirby, Benjamin; McCreery, Ryan W
We examined the effects of vocabulary, lexical characteristics (age of acquisition and phonotactic probability), and auditory access (aided audibility and daily hearing aid [HA] use) on speech perception skills in children with HAs. Participants included 24 children with HAs and 25 children with normal hearing (NH), ages 5-12 years. Groups were matched on age, expressive and receptive vocabulary, articulation, and nonverbal working memory. Participants repeated monosyllabic words and nonwords in noise. Stimuli varied on age of acquisition, lexical frequency, and phonotactic probability. Performance in each condition was measured by the signal-to-noise ratio at which the child could accurately repeat 50% of the stimuli. Children from both groups with larger vocabularies showed better performance than children with smaller vocabularies on nonwords and late-acquired words but not early-acquired words. Overall, children with HAs showed poorer performance than children with NH. Auditory access was not associated with speech perception for the children with HAs. Children with HAs show deficits in sensitivity to phonological structure but appear to take advantage of vocabulary skills to support speech perception in the same way as children with NH. Further investigation is needed to understand the causes of the gap that exists between the overall speech perception abilities of children with HAs and children with NH.
Ji Woon Jeong
Full Text Available Schizotypy refers to the personality trait of experiencing “psychotic” symptoms and can be regarded as a predisposition of schizophrenia-spectrum psychopathology (Raine, 1991. Cumulative evidence has revealed that individuals with schizotypy, as well as schizophrenia patients, have emotional processing deficits. In the present study, we investigated multimodal emotion perception in schizotypy and implemented the machine learning technique to find out whether a schizotypy group (ST is distinguishable from a control group (NC, using electroencephalogram (EEG signals. Forty-five subjects (30 ST and 15 NC were divided into two groups based on their scores on a Schizotypal Personality Questionnaire. All participants performed an audiovisual emotion perception test while EEG was recorded. After the preprocessing stage, the discriminatory features were extracted using a mean subsampling technique. For an accurate estimation of covariance matrices, the shrinkage linear discriminant algorithm was used. The classification attained over 98% accuracy and zero rate of false-positive results. This method may have important clinical implications in discriminating those among the general population who have a subtle risk for schizotypy, requiring intervention in advance.
Liu, Fang; Jiang, Cunmei; Wang, Bei; Xu, Yi; Patel, Aniruddh D
This study investigated the underlying link between speech and music by examining whether and to what extent congenital amusia, a musical disorder characterized by degraded pitch processing, would impact spoken sentence comprehension for speakers of Mandarin, a tone language. Sixteen Mandarin-speaking amusics and 16 matched controls were tested on the intelligibility of news-like Mandarin sentences with natural and flat fundamental frequency (F0) contours (created via speech resynthesis) under four signal-to-noise (SNR) conditions (no noise, +5, 0, and -5dB SNR). While speech intelligibility in quiet and extremely noisy conditions (SNR=-5dB) was not significantly compromised by flattened F0, both amusic and control groups achieved better performance with natural-F0 sentences than flat-F0 sentences under moderately noisy conditions (SNR=+5 and 0dB). Relative to normal listeners, amusics demonstrated reduced speech intelligibility in both quiet and noise, regardless of whether the F0 contours of the sentences were natural or flattened. This deficit in speech intelligibility was not associated with impaired pitch perception in amusia. These findings provide evidence for impaired speech comprehension in congenital amusia, suggesting that the deficit of amusics extends beyond pitch processing and includes segmental processing. Copyright © 2014 Elsevier Ltd. All rights reserved.
Moore, Brian C J; Tyler, Lorraine K; Marslen-Wilson, William
Spoken language communication is arguably the most important activity that distinguishes humans from non-human species. This paper provides an overview of the review papers that make up this theme issue on the processes underlying speech communication. The volume includes contributions from researchers who specialize in a wide range of topics within the general area of speech perception and language processing. It also includes contributions from key researchers in neuroanatomy and functional neuro-imaging, in an effort to cut across traditional disciplinary boundaries and foster cross-disciplinary interactions in this important and rapidly developing area of the biological and cognitive sciences.
in cross-language and second language speech perception research: The mapping issue (the perceptual relationship of sounds of the native and the nonnative language in the mind of the native listener and the L2 learner), the perceptual and learning difficulty/ease issue (how this relationship may or may...... not cause perceptual and learning difficulty), and the plasticity issue (whether and how experience with the nonnative language affects the perceptual organization of speech sounds in the mind of L2 learners). One important general conclusion from this research is that perceptual learning is possible at all...
Preisig, Basil C; Eggenberger, Noëmi; Zito, Giuseppe; Vanbellingen, Tim; Schumacher, Rahel; Hopfner, Simone; Nyffeler, Thomas; Gutbrod, Klemens; Annoni, Jean-Marie; Bohlhalter, Stephan; Müri, René M
Co-speech gestures are part of nonverbal communication during conversations. They either support the verbal message or provide the interlocutor with additional information. Furthermore, they prompt as nonverbal cues the cooperative process of turn taking. In the present study, we investigated the influence of co-speech gestures on the perception of dyadic dialogue in aphasic patients. In particular, we analysed the impact of co-speech gestures on gaze direction (towards speaker or listener) and fixation of body parts. We hypothesized that aphasic patients, who are restricted in verbal comprehension, adapt their visual exploration strategies. Sixteen aphasic patients and 23 healthy control subjects participated in the study. Visual exploration behaviour was measured by means of a contact-free infrared eye-tracker while subjects were watching videos depicting spontaneous dialogues between two individuals. Cumulative fixation duration and mean fixation duration were calculated for the factors co-speech gesture (present and absent), gaze direction (to the speaker or to the listener), and region of interest (ROI), including hands, face, and body. Both aphasic patients and healthy controls mainly fixated the speaker's face. We found a significant co-speech gesture × ROI interaction, indicating that the presence of a co-speech gesture encouraged subjects to look at the speaker. Further, there was a significant gaze direction × ROI × group interaction revealing that aphasic patients showed reduced cumulative fixation duration on the speaker's face compared to healthy controls. Co-speech gestures guide the observer's attention towards the speaker, the source of semantic input. It is discussed whether an underlying semantic processing deficit or a deficit to integrate audio-visual information may cause aphasic patients to explore less the speaker's face. Copyright © 2014 Elsevier Ltd. All rights reserved.
Borrie, Stephanie A; Lansford, Kaitlin L; Barrett, Tyson S
The perception of rhythm cues plays an important role in recognizing spoken language, especially in adverse listening conditions. Indeed, this has been shown to hold true even when the rhythm cues themselves are dysrhythmic. This study investigates whether expertise in rhythm perception provides a processing advantage for perception (initial intelligibility) and learning (intelligibility improvement) of naturally dysrhythmic speech, dysarthria. Fifty young adults with typical hearing participated in 3 key tests, including a rhythm perception test, a receptive vocabulary test, and a speech perception and learning test, with standard pretest, familiarization, and posttest phases. Initial intelligibility scores were calculated as the proportion of correct pretest words, while intelligibility improvement scores were calculated by subtracting this proportion from the proportion of correct posttest words. Rhythm perception scores predicted intelligibility improvement scores but not initial intelligibility. On the other hand, receptive vocabulary scores predicted initial intelligibility scores but not intelligibility improvement. Expertise in rhythm perception appears to provide an advantage for processing dysrhythmic speech, but a familiarization experience is required for the advantage to be realized. Findings are discussed in relation to the role of rhythm in speech processing and shed light on processing models that consider the consequence of rhythm abnormalities in dysarthria.
Full Text Available Audiovisual translation is now a well-established sub-discipline of Translation Studies (TS: a position that it has reached over the last twenty years or so. Italian scholars and professionals in the field have made a substantial contribution to this successful development, a brief overview of which will be given in the first part of this article, inevitably concentrating on dubbing in the Italian context. Special attention will be devoted to the question of target audience perception, an area where researchers in the University of Bologna at Forlì have excelled. The second part of the article applies the methodology followed by the above mentioned researchers in a case study of how Italian end users perceive the dubbed version of the British film The History Boys (2006, which contains a plethora of culture-specific verbal and visual references to the English education system. The aim of the study was to ascertain: a whether translation/adaptation allows the transmission in this admittedly constrained medium of all the intended culture-bound issues, only too well known to the source audience, and, if so, to what extent, and b whether the target audience respondents to the e-questionnaire used were aware that they were missing information. The linked, albeit controversial, issue of quality assessment will also be addressed.
Besson, Patricia; Richiardi, Jonas; Bourdin, Christophe; Bringoux, Lionel; Mestre, Daniel R; Vercher, Jean-Louis
Thanks to their different senses, human observers acquire multiple information coming from their environment. Complex cross-modal interactions occur during this perceptual process. This article proposes a framework to analyze and model these interactions through a rigorous and systematic data-driven process. This requires considering the general relationships between the physical events or factors involved in the process, not only in quantitative terms, but also in term of the influence of one factor on another. We use tools from information theory and probabilistic reasoning to derive relationships between the random variables of interest, where the central notion is that of conditional independence. Using mutual information analysis to guide the model elicitation process, a probabilistic causal model encoded as a Bayesian network is obtained. We exemplify the method by using data collected in an audio-visual localization task for human subjects, and we show that it yields a well-motivated model with good predictive ability. The model elicitation process offers new prospects for the investigation of the cognitive mechanisms of multisensory perception.
Schreitmüller, Stefan; Frenken, Miriam; Bentz, Lüder; Ortmann, Magdalene; Walger, Martin; Meister, Hartmut
Watching a talker's mouth is beneficial for speech reception (SR) in many communication settings, especially in noise and when hearing is impaired. Measures for audiovisual (AV) SR can be valuable in the framework of diagnosing or treating hearing disorders. This study addresses the lack of standardized methods in many languages for assessing lipreading, AV gain, and integration. A new method is validated that supplements a German speech audiometric test with visualizations of the synthetic articulation of an avatar that was used, for it is feasible to lip-sync auditory speech in a highly standardized way. Three hypotheses were formed according to the literature on AV SR that used live or filmed talkers. It was tested whether respective effects could be reproduced with synthetic articulation: (1) cochlear implant (CI) users have a higher visual-only SR than normal-hearing (NH) individuals, and younger individuals obtain higher lipreading scores than older persons. (2) Both CI and NH gain from presenting AV over unimodal (auditory or visual) sentences in noise. (3) Both CI and NH listeners efficiently integrate complementary auditory and visual speech features. In a controlled, cross-sectional study with 14 experienced CI users (mean age 47.4) and 14 NH individuals (mean age 46.3, similar broad age distribution), lipreading, AV gain, and integration of a German matrix sentence test were assessed. Visual speech stimuli were synthesized by the articulation of the Talking Head system "MASSY" (Modular Audiovisual Speech Synthesizer), which displayed standardized articulation with respect to the visibility of German phones. In line with the hypotheses and previous literature, CI users had a higher mean visual-only SR than NH individuals (CI, 38%; NH, 12%; p < 0.001). Age was correlated with lipreading such that within each group, younger individuals obtained higher visual-only scores than older persons (rCI = -0.54; p = 0.046; rNH = -0.78; p < 0.001). Both CI and NH
Full Text Available Listeners vary in their ability to understand speech in noisy environments. Hearing sensitivity, as measured by pure-tone audiometry, can only partly explain these results, and cognition has emerged as another key concept. Although cognition relates to speech perception, the exact nature of the relationship remains to be fully understood. This study investigates how different aspects of cognition, particularly working memory and attention, relate to speech intelligibility for various tests.Perceptual accuracy of speech perception represents just one aspect of functioning in a listening environment. Activity and participation limits imposed by hearing loss, in addition to the demands of a listening environment, are also important and may be better captured by self-report questionnaires. Understanding how speech perception relates to self-reported aspects of listening forms the second focus of the study.Forty-four listeners aged between 50-74 years with mild SNHL were tested on speech perception tests differing in complexity from low (phoneme discrimination in quiet, to medium (digit triplet perception in speech-shaped noise to high (sentence perception in modulated noise; cognitive tests of attention, memory, and nonverbal IQ; and self-report questionnaires of general health-related and hearing-specific quality of life.Hearing sensitivity and cognition related to intelligibility differently depending on the speech test: neither was important for phoneme discrimination, hearing sensitivity alone was important for digit triplet perception, and hearing and cognition together played a role in sentence perception. Self-reported aspects of auditory functioning were correlated with speech intelligibility to different degrees, with digit triplets in noise showing the richest pattern. The results suggest that intelligibility tests can vary in their auditory and cognitive demands and their sensitivity to the challenges that auditory environments pose on
Heinrich, Antje; Henshaw, Helen; Ferguson, Melanie A.
Listeners vary in their ability to understand speech in noisy environments. Hearing sensitivity, as measured by pure-tone audiometry, can only partly explain these results, and cognition has emerged as another key concept. Although cognition relates to speech perception, the exact nature of the relationship remains to be fully understood. This study investigates how different aspects of cognition, particularly working memory and attention, relate to speech intelligibility for various tests. Perceptual accuracy of speech perception represents just one aspect of functioning in a listening environment. Activity and participation limits imposed by hearing loss, in addition to the demands of a listening environment, are also important and may be better captured by self-report questionnaires. Understanding how speech perception relates to self-reported aspects of listening forms the second focus of the study. Forty-four listeners aged between 50 and 74 years with mild sensorineural hearing loss were tested on speech perception tests differing in complexity from low (phoneme discrimination in quiet), to medium (digit triplet perception in speech-shaped noise) to high (sentence perception in modulated noise); cognitive tests of attention, memory, and non-verbal intelligence quotient; and self-report questionnaires of general health-related and hearing-specific quality of life. Hearing sensitivity and cognition related to intelligibility differently depending on the speech test: neither was important for phoneme discrimination, hearing sensitivity alone was important for digit triplet perception, and hearing and cognition together played a role in sentence perception. Self-reported aspects of auditory functioning were correlated with speech intelligibility to different degrees, with digit triplets in noise showing the richest pattern. The results suggest that intelligibility tests can vary in their auditory and cognitive demands and their sensitivity to the challenges that
Won, Jong Ho; Moon, Il Joon; Jin, Sunhwa; Park, Heesung; Woo, Jihwan; Cho, Yang-Sun; Chung, Won-Ho; Hong, Sung Hwa
Spectrotemporal modulation (STM) detection performance was examined for cochlear implant (CI) users. The test involved discriminating between an unmodulated steady noise and a modulated stimulus. The modulated stimulus presents frequency modulation patterns that change in frequency over time. In order to examine STM detection performance for different modulation conditions, two different temporal modulation rates (5 and 10 Hz) and three different spectral modulation densities (0.5, 1.0, and 2.0 cycles/octave) were employed, producing a total 6 different STM stimulus conditions. In order to explore how electric hearing constrains STM sensitivity for CI users differently from acoustic hearing, normal-hearing (NH) and hearing-impaired (HI) listeners were also tested on the same tasks. STM detection performance was best in NH subjects, followed by HI subjects. On average, CI subjects showed poorest performance, but some CI subjects showed high levels of STM detection performance that was comparable to acoustic hearing. Significant correlations were found between STM detection performance and speech identification performance in quiet and in noise. In order to understand the relative contribution of spectral and temporal modulation cues to speech perception abilities for CI users, spectral and temporal modulation detection was performed separately and related to STM detection and speech perception performance. The results suggest that that slow spectral modulation rather than slow temporal modulation may be important for determining speech perception capabilities for CI users. Lastly, test-retest reliability for STM detection was good with no learning. The present study demonstrates that STM detection may be a useful tool to evaluate the ability of CI sound processing strategies to deliver clinically pertinent acoustic modulation information.
Stekelenburg, Jeroen J; Vroomen, Jean
In many natural audiovisual events (e.g., a clap of the two hands), the visual signal precedes the sound and thus allows observers to predict when, where, and which sound will occur. Previous studies have reported that there are distinct neural correlates of temporal (when) versus phonetic/semantic (which) content on audiovisual integration. Here we examined the effect of visual prediction of auditory location (where) in audiovisual biological motion stimuli by varying the spatial congruency between the auditory and visual parts. Visual stimuli were presented centrally, whereas auditory stimuli were presented either centrally or at 90° azimuth. Typical sub-additive amplitude reductions (AV - V audiovisual interaction was also found at 40-60 ms (P50) in the spatially congruent condition, while no effect of congruency was found on the suppression of the P2. This indicates that visual prediction of auditory location can be coded very early in auditory processing.
Krieger-Redwood, Katya; Gaskell, M Gareth; Lindsay, Shane; Jefferies, Elizabeth
Several accounts of speech perception propose that the areas involved in producing language are also involved in perceiving it. In line with this view, neuroimaging studies show activation of premotor cortex (PMC) during phoneme judgment tasks; however, there is debate about whether speech perception necessarily involves motor processes, across all task contexts, or whether the contribution of PMC is restricted to tasks requiring explicit phoneme awareness. Some aspects of speech processing, such as mapping sounds onto meaning, may proceed without the involvement of motor speech areas if PMC specifically contributes to the manipulation and categorical perception of phonemes. We applied TMS to three sites-PMC, posterior superior temporal gyrus, and occipital pole-and for the first time within the TMS literature, directly contrasted two speech perception tasks that required explicit phoneme decisions and mapping of speech sounds onto semantic categories, respectively. TMS to PMC disrupted explicit phonological judgments but not access to meaning for the same speech stimuli. TMS to two further sites confirmed that this pattern was site specific and did not reflect a generic difference in the susceptibility of our experimental tasks to TMS: stimulation of pSTG, a site involved in auditory processing, disrupted performance in both language tasks, whereas stimulation of occipital pole had no effect on performance in either task. These findings demonstrate that, although PMC is important for explicit phonological judgments, crucially, PMC is not necessary for mapping speech onto meanings.
Visual Cues Contribute Differentially to Audiovisual Perception of Consonants and Vowels in Improving Recognition and Reducing Cognitive Demands in Listeners With Hearing Impairment Using Hearing Aids.
Moradi, Shahram; Lidestam, Björn; Danielsson, Henrik; Ng, Elaine Hoi Ning; Rönnberg, Jerker
We sought to examine the contribution of visual cues in audiovisual identification of consonants and vowels-in terms of isolation points (the shortest time required for correct identification of a speech stimulus), accuracy, and cognitive demands-in listeners with hearing impairment using hearing aids. The study comprised 199 participants with hearing impairment (mean age = 61.1 years) with bilateral, symmetrical, mild-to-severe sensorineural hearing loss. Gated Swedish consonants and vowels were presented aurally and audiovisually to participants. Linear amplification was adjusted for each participant to assure audibility. The reading span test was used to measure participants' working memory capacity. Audiovisual presentation resulted in shortened isolation points and improved accuracy for consonants and vowels relative to auditory-only presentation. This benefit was more evident for consonants than vowels. In addition, correlations and subsequent analyses revealed that listeners with higher scores on the reading span test identified both consonants and vowels earlier in auditory-only presentation, but only vowels (not consonants) in audiovisual presentation. Consonants and vowels differed in terms of the benefits afforded from their associative visual cues, as indicated by the degree of audiovisual benefit and reduction in cognitive demands linked to the identification of consonants and vowels presented audiovisually.
Mens, L.H.M.; Berenstein, C.K.
OBJECTIVE: To study the effect of two multipolar electrode configurations on speech perception, pitch perception, and the intracochlear electrical field. STUDY DESIGN: Crossover design; within subject. SETTING: Tertiary referral center. PATIENTS: Eight experienced adult cochlear implant users.
Heinrich, Antje; Henshaw, Helen; Ferguson, Melanie A
Good speech perception and communication skills in everyday life are crucial for participation and well-being, and are therefore an overarching aim of auditory rehabilitation. Both behavioral and self-report measures can be used to assess these skills. However, correlations between behavioral and self-report speech perception measures are often low. One possible explanation is that there is a mismatch between the specific situations used in the assessment of these skills in each method, and a more careful matching across situations might improve consistency of results. The role that cognition plays in specific speech situations may also be important for understanding communication, as speech perception tests vary in their cognitive demands. In this study, the role of executive function, working memory (WM) and attention in behavioral and self-report measures of speech perception was investigated. Thirty existing hearing aid users with mild-to-moderate hearing loss aged between 50 and 74 years completed a behavioral test battery with speech perception tests ranging from phoneme discrimination in modulated noise (easy) to words in multi-talker babble (medium) and keyword perception in a carrier sentence against a distractor voice (difficult). In addition, a self-report measure of aided communication, residual disability from the Glasgow Hearing Aid Benefit Profile, was obtained. Correlations between speech perception tests and self-report measures were higher when specific speech situations across both were matched. Cognition correlated with behavioral speech perception test results but not with self-report. Only the most difficult speech perception test, keyword perception in a carrier sentence with a competing distractor voice, engaged executive functions in addition to WM. In conclusion, any relationship between behavioral and self-report speech perception is not mediated by a shared correlation with cognition.
Full Text Available In many natural audiovisual events (e.g., a clap of the two hands, the visual signal precedes the sound and thus allows observers to predict when, where, and which sound will occur. Previous studies have already reported that there are distinct neural correlates of temporal (when versus phonetic/semantic (which content on audiovisual integration. Here we examined the effect of visual prediction of auditory location (where in audiovisual biological motion stimuli by varying the spatial congruency between the auditory and visual part of the audiovisual stimulus. Visual stimuli were presented centrally, whereas auditory stimuli were presented either centrally or at 90° azimuth. Typical subadditive amplitude reductions (AV – V < A were found for the auditory N1 and P2 for spatially congruent and incongruent conditions. The new finding is that the N1 suppression was larger for spatially congruent stimuli. A very early audiovisual interaction was also found at 30-50 ms in the spatially congruent condition, while no effect of congruency was found on the suppression of the P2. This indicates that visual prediction of auditory location can be coded very early in auditory processing.
Full Text Available Audiovisual capture happens when information across modalities get fused into a coherent percept. Ambiguous multi-modal stimuli have the potential to be powerful tools to observe such effects. We used such stimuli made of temporally synchronized and spatially co-localized visual flashes and auditory tones. The flashes produced bistable apparent motion and the tones produced ambiguous streaming. We measured strong interferences between perceptual decisions in each modality, a case of audiovisual capture. However, does this mean that audiovisual capture occurs before bistable decision? We argue that this is not the case, as the interference had a slow temporal dynamics and was modulated by audiovisual congruence, suggestive of high-level factors such as attention or intention. We propose a framework to integrate bistability and audiovisual capture, which distinguishes between “what” competes and “how” it competes (Hupé et al., 2008. The audiovisual interactions may be the result of contextual influences on neural representations (“what” competes, quite independent from the causal mechanisms of perceptual switches (“how” it competes. This framework predicts that audiovisual capture can bias bistability especially if modalities are congruent (Sato et al., 2007, but that is fundamentally distinct in nature from the bistable competition mechanism.
Pisoni, David B.
This paper reviews some of the major evidence and arguments currently available to support the view that human speech perception may require the use of specialized neural mechanisms for perceptual analysis. Experiments using synthetically produced speech signals with adults are briefly summarized and extensions of these results to infants and other organisms are reviewed with an emphasis towards detailing those aspects of speech perception that may require some need for specialized species-specific processors. Finally, some comments on the role of early experience in perceptual development are provided as an attempt to identify promising areas of new research in speech perception. PMID:399200
Nisreen Naji Al-Khawaldeh
Full Text Available This paper presents the findings of an empirical study which compares Jordanian and English native speakers’ perceptions about the speech act of thanking. The forty interviews conducted revealed some similarities but also of remarkable cross-cultural differences relating to the significance of thanking, the variables affecting it, and the appropriate linguistic and paralinguistic choices, as well as their impact on the interpretation of thanking behaviour. The most important theoretical finding is that the data, while consistent with many views found in the existing literature, do not support Brown and Levinson’s (1987 claim that thanking is a speech act which intrinsically threatens the speaker’s negative face because it involves overt acceptance of an imposition on the speaker. Rather, thanking should be viewed as a means of establishing and sustaining social relationships. The study findings suggest that cultural variation in thanking is due to the high degree of sensitivity of this speech act to the complex interplay of a range of social and contextual variables, and point to some promising directions for further research.
Mealings, Kiri T.; Demuth, Katherine; Buchholz, Jörg; Dillon, Harvey
Purpose: Open-plan classroom styles are increasingly being adopted in Australia despite evidence that their high intrusive noise levels adversely affect learning. The aim of this study was to develop a new Australian speech perception task (the Mealings, Demuth, Dillon, and Buchholz Classroom Speech Perception Test) and use it in an open-plan…
Hickok, Gregory; Costanzo, Maddalena; Capasso, Rita; Miceli, Gabriele
Motor theories of speech perception have been re-vitalized as a consequence of the discovery of mirror neurons. Some authors have even promoted a strong version of the motor theory, arguing that the motor speech system is critical for perception. Part of the evidence that is cited in favor of this claim is the observation from the early 1980s that…
Eskelund, Kasper; MacDonald, Ewen; Andersen, Tobias
We perceive identity, expression and speech from faces. While perception of identity and expression depends crucially on the configuration of facial features it is less clear whether this holds for visual speech perception. Facial configuration is poorly perceived for upside-down faces as demonst...
Naturally produced English clear speech has been shown to be more intelligible than English conversational speech. However, little is known about the extent of the clear speech effects in the production of nonnative English, and perception of foreign-accented English by younger and older listeners. The present study examined whether Cantonese speakers would employ the same strategies as those used by native English speakers in producing clear speech in their second language. Also, the clear s...
Schellenberg, E Glenn
Claims of beneficial side effects of music training are made for many different abilities, including verbal and visuospatial abilities, executive functions, working memory, IQ, and speech perception in particular. Such claims assume that music training causes the associations even though children who take music lessons are likely to differ from other children in music aptitude, which is associated with many aspects of speech perception. Music training in childhood is also associated with cognitive, personality, and demographic variables, and it is well established that IQ and personality are determined largely by genetics. Recent evidence also indicates that the role of genetics in music aptitude and music achievement is much larger than previously thought. In short, music training is an ideal model for the study of gene-environment interactions but far less appropriate as a model for the study of plasticity. Children seek out environments, including those with music lessons, that are consistent with their predispositions; such environments exaggerate preexisting individual differences. © 2015 New York Academy of Sciences.
Full Text Available Human locomotion typically creates noise, a possible consequence of which is the masking of sound signals originating in the surroundings. When walking side by side, people often subconsciously synchronize their steps. The neurophysiological and evolutionary background of this behavior is unclear. The present study investigated the potential of sound created by walking to mask perception of speech and compared the masking produced by walking in step with that produced by unsynchronized walking. The masking sound (footsteps on gravel and the target sound (speech were presented through the same speaker to 15 normal-hearing subjects. The original recorded walking sound was modified to mimic the sound of two individuals walking in pace or walking out of synchrony. The participants were instructed to adjust the sound level of the target sound until they could just comprehend the speech signal ("just follow conversation" or JFC level when presented simultaneously with synchronized or unsynchronized walking sound at 40 dBA, 50 dBA, 60 dBA, or 70 dBA. Synchronized walking sounds produced slightly less masking of speech than did unsynchronized sound. The median JFC threshold in the synchronized condition was 38.5 dBA, while the corresponding value for the unsynchronized condition was 41.2 dBA. Combined results at all sound pressure levels showed an improvement in the signal-to-noise ratio (SNR for synchronized footsteps; the median difference was 2.7 dB and the mean difference was 1.2 dB [P < 0.001, repeated-measures analysis of variance (RM-ANOVA]. The difference was significant for masker levels of 50 dBA and 60 dBA, but not for 40 dBA or 70 dBA. This study provides evidence that synchronized walking may reduce the masking potential of footsteps.
Full Text Available Listeners must accomplish two complementary perceptual feats in extracting a message from speech. They must discriminate linguistically-relevant acoustic variability and generalize across irrelevant variability. Said another way, they must categorize speech. Since the mapping of acoustic variability is language-specific, these categories must be learned from experience. Thus, understanding how, in general, the auditory system acquires and represents categories can inform us about the toolbox of mechanisms available to speech perception. This perspective invites consideration of findings from cognitive neuroscience literatures outside of the speech domain as a means of constraining models of speech perception. Although neurobiological models of speech perception have mainly focused on cerebral cortex, research outside the speech domain is consistent with the possibility of significant subcortical contributions in category learning. Here, we review the functional role of one such structure, the basal ganglia. We examine research from animal electrophysiology, human neuroimaging, and behavior to consider characteristics of basal ganglia processing that may be advantageous for speech category learning. We also present emerging evidence for a direct role for basal ganglia in learning auditory categories in a complex, naturalistic task intended to model the incidental manner in which speech categories are acquired. To conclude, we highlight new research questions that arise in incorporating the broader neuroscience research literature in modeling speech perception, and suggest how understanding contributions of the basal ganglia can inform attempts to optimize training protocols for learning non-native speech categories in adulthood.
Dreschler, W. A.; Plomp, R.
Twenty-one sensorineurally hearing-impaired adolescents were studied with an extensive battery of tone-perception, phoneme-perception, and speech-perception tests. Tests on loudness perception, frequency selectivity, and temporal resolution at the test frequencies of 500, 1000, and 2000 Hz were
Nabelek, Anna K.; Tampas, Joanna W.; Burchfield, Samuel B.
l, speech perception in noiseBackground noise is a significant factor influencing hearing-aid satisfaction and is a major reason for rejection of hearing aids. Attempts have been made by previous researchers to relate the use of hearing aids to speech perception in noise (SPIN), with an expectation of improved speech perception followed by an…
Hakvoort, Britt; de Bree, Elise; van der Leij, Aryan; Maassen, Ben; van Setten, Ellie; Maurits, Natasha; van Zuijen, Titia L.
Purpose: This study assessed whether a categorical speech perception (CP) deficit is associated with dyslexia or familial risk for dyslexia, by exploring a possible cascading relation from speech perception to phonology to reading and by identifying whether speech perception distinguishes familial risk (FR) children with dyslexia (FRD) from those…
Uhler, Kristin M; Baca, Rosalinda; Dudas, Emily; Fredrickson, Tammy
Speech perception measures have long been considered an integral piece of the audiological assessment battery. Currently, a prelinguistic, standardized measure of speech perception is missing in the clinical assessment battery for infants and young toddlers. Such a measure would allow systematic assessment of speech perception abilities of infants as well as the potential to investigate the impact early identification of hearing loss and early fitting of amplification have on the auditory pathways. To investigate the impact of sensation level (SL) on the ability of infants with normal hearing (NH) to discriminate /a-i/ and /ba-da/ and to determine if performance on the two contrasts are significantly different in predicting the discrimination criterion. The design was based on a survival analysis model for event occurrence and a repeated measures logistic model for binary outcomes. The outcome for survival analysis was the minimum SL for criterion and the outcome for the logistic regression model was the presence/absence of achieving the criterion. Criterion achievement was designated when an infant's proportion correct score was >0.75 on the discrimination performance task. Twenty-two infants with NH sensitivity participated in this study. There were 9 males and 13 females, aged 6-14 mo. Testing took place over two to three sessions. The first session consisted of a hearing test, threshold assessment of the two speech sounds (/a/ and /i/), and if time and attention allowed, visual reinforcement infant speech discrimination (VRISD). The second session consisted of VRISD assessment for the two test contrasts (/a-i/ and /ba-da/). The presentation level started at 50 dBA. If the infant was unable to successfully achieve criterion (>0.75) at 50 dBA, the presentation level was increased to 70 dBA followed by 60 dBA. Data examination included an event analysis, which provided the probability of criterion distribution across SL. The second stage of the analysis was a
Yamamoto, Kosuke; Kawabata, Hideaki
We ordinarily speak fluently, even though our perceptions of our own voices are disrupted by various environmental acoustic properties. The underlying mechanism of speech is supposed to monitor the temporal relationship between speech production and the perception of auditory feedback, as suggested by a reduction in speech fluency when the speaker is exposed to delayed auditory feedback (DAF). While many studies have reported that DAF influences speech motor processing, its relationship to the temporal tuning effect on multimodal integration, or temporal recalibration, remains unclear. We investigated whether the temporal aspects of both speech perception and production change due to adaptation to the delay between the motor sensation and the auditory feedback. This is a well-used method of inducing temporal recalibration. Participants continually read texts with specific DAF times in order to adapt to the delay. Then, they judged the simultaneity between the motor sensation and the vocal feedback. We measured the rates of speech with which participants read the texts in both the exposure and re-exposure phases. We found that exposure to DAF changed both the rate of speech and the simultaneity judgment, that is, participants' speech gained fluency. Although we also found that a delay of 200 ms appeared to be most effective in decreasing the rates of speech and shifting the distribution on the simultaneity judgment, there was no correlation between these measurements. These findings suggest that both speech motor production and multimodal perception are adaptive to temporal lag but are processed in distinct ways.
Cholinergic Potentiation and Audiovisual Repetition-Imitation Therapy Improve Speech Production and Communication Deficits in a Person with Crossed Aphasia by Inducing Structural Plasticity in White Matter Tracts.
Berthier, Marcelo L; De-Torres, Irene; Paredes-Pacheco, José; Roé-Vellvé, Núria; Thurnhofer-Hemsi, Karl; Torres-Prioris, María J; Alfaro, Francisco; Moreno-Torres, Ignacio; López-Barroso, Diana; Dávila, Guadalupe
Donepezil (DP), a cognitive-enhancing drug targeting the cholinergic system, combined with massed sentence repetition training augmented and speeded up recovery of speech production deficits in patients with chronic conduction aphasia and extensive left hemisphere infarctions (Berthier et al., 2014). Nevertheless, a still unsettled question is whether such improvements correlate with restorative structural changes in gray matter and white matter pathways mediating speech production. In the present study, we used pharmacological magnetic resonance imaging to study treatment-induced brain changes in gray matter and white matter tracts in a right-handed male with chronic conduction aphasia and a right subcortical lesion (crossed aphasia). A single-patient, open-label multiple-baseline design incorporating two different treatments and two post-treatment evaluations was used. The patient received an initial dose of DP (5 mg/day) which was maintained during 4 weeks and then titrated up to 10 mg/day and administered alone (without aphasia therapy) during 8 weeks (Endpoint 1). Thereafter, the drug was combined with an audiovisual repetition-imitation therapy (Look-Listen-Repeat, LLR) during 3 months (Endpoint 2). Language evaluations, diffusion weighted imaging (DWI), and voxel-based morphometry (VBM) were performed at baseline and at both endpoints in JAM and once in 21 healthy control males. Treatment with DP alone and combined with LLR therapy induced marked improvement in aphasia and communication deficits as well as in selected measures of connected speech production, and phrase repetition. The obtained gains in speech production remained well-above baseline scores even 4 months after ending combined therapy. Longitudinal DWI showed structural plasticity in the right frontal aslant tract and direct segment of the arcuate fasciculus with both interventions. VBM revealed no structural changes in other white matter tracts nor in cortical areas linked by these tracts. In
Full Text Available Developmental stuttering is a speech disorder that disrupts the ability to produce speech fluently. While stuttering is typically diagnosed based on one's behavior during speech production, some models suggest that it involves more central representations of language, and thus may affect language perception as well. Here we tested the hypothesis that developmental stuttering implicates neural systems involved in language perception, in a task that manipulates comprehensibility without an overt speech production component. We used functional magnetic resonance imaging to measure blood oxygenation level dependent (BOLD signals in adults who do and do not stutter, while they were engaged in an incidental speech perception task. We found that speech perception evokes stronger activation in adults who stutter (AWS compared to controls, specifically in the right inferior frontal gyrus (RIFG and in left Heschl's gyrus (LHG. Significant differences were additionally found in the lateralization of response in the inferior frontal cortex: AWS showed bilateral inferior frontal activity, while controls showed a left lateralized pattern of activation. These findings suggest that developmental stuttering is associated with an imbalanced neural network for speech processing, which is not limited to speech production, but also affects cortical responses during speech perception.
Van Engen, Kristin J; Xie, Zilong; Chandrasekaran, Bharath
In noisy situations, visual information plays a critical role in the success of speech communication: listeners are better able to understand speech when they can see the speaker. Visual influence on auditory speech perception is also observed in the McGurk effect, in which discrepant visual information alters listeners' auditory perception of a spoken syllable. When hearing /ba/ while seeing a person saying /ga/, for example, listeners may report hearing /da/. Because these two phenomena have been assumed to arise from a common integration mechanism, the McGurk effect has often been used as a measure of audiovisual integration in speech perception. In this study, we test whether this assumed relationship exists within individual listeners. We measured participants' susceptibility to the McGurk illusion as well as their ability to identify sentences in noise across a range of signal-to-noise ratios in audio-only and audiovisual modalities. Our results do not show a relationship between listeners' McGurk susceptibility and their ability to use visual cues to understand spoken sentences in noise, suggesting that McGurk susceptibility may not be a valid measure of audiovisual integration in everyday speech processing.
Kirsten E Smayda
Full Text Available Speech perception is critical to everyday life. Oftentimes noise can degrade a speech signal; however, because of the cues available to the listener, such as visual and semantic cues, noise rarely prevents conversations from continuing. The interaction of visual and semantic cues in aiding speech perception has been studied in young adults, but the extent to which these two cues interact for older adults has not been studied. To investigate the effect of visual and semantic cues on speech perception in older and younger adults, we recruited forty-five young adults (ages 18-35 and thirty-three older adults (ages 60-90 to participate in a speech perception task. Participants were presented with semantically meaningful and anomalous sentences in audio-only and audio-visual conditions. We hypothesized that young adults would outperform older adults across SNRs, modalities, and semantic contexts. In addition, we hypothesized that both young and older adults would receive a greater benefit from a semantically meaningful context in the audio-visual relative to audio-only modality. We predicted that young adults would receive greater visual benefit in semantically meaningful contexts relative to anomalous contexts. However, we predicted that older adults could receive a greater visual benefit in either semantically meaningful or anomalous contexts. Results suggested that in the most supportive context, that is, semantically meaningful sentences presented in the audiovisual modality, older adults performed similarly to young adults. In addition, both groups received the same amount of visual and meaningful benefit. Lastly, across groups, a semantically meaningful context provided more benefit in the audio-visual modality relative to the audio-only modality, and the presence of visual cues provided more benefit in semantically meaningful contexts relative to anomalous contexts. These results suggest that older adults can perceive speech as well as younger
Full Text Available This study investigated whether auditory, speech perception and phonological skills are tightly interrelated or independently contributing to reading. We assessed each of these three skills in 36 adults with a past diagnosis of dyslexia and 54 matched normal reading adults. Phonological skills were tested by the typical threefold tasks, i.e. rapid automatic naming, verbal short term memory and phonological awareness. Dynamic auditory processing skills were assessed by means of a frequency modulation (FM and an amplitude rise time (RT; an intensity discrimination task (ID was included as a non-dynamic control task. Speech perception was assessed by means of sentences and words in noise tasks. Group analysis revealed significant group differences in auditory tasks (i.e. RT and ID and in phonological processing measures, yet no differences were found for speech perception. In addition, performance on RT discrimination correlated with reading but this relation was mediated by phonological processing and not by speech in noise. Finally, inspection of the individual scores revealed that the dyslexic readers showed an increased proportion of deviant subjects on the slow-dynamic auditory and phonological tasks, yet each individual dyslexic reader does not display a clear pattern of deficiencies across the levels of processing skills. Although our results support phonological and slow-rate dynamic auditory deficits which relate to literacy, they suggest that at the individual level, problems in reading and writing cannot be explained by the cascading auditory theory. Instead, dyslexic adults seem to vary considerably in the extent to which each of the auditory and phonological factors are expressed and interact with environmental and higher-order cognitive influences.
Full Text Available The neural basis of speech perception has been debated for over a century. While it is generally agreed that the superior temporal lobes are critical for the perceptual analysis of speech, a major current topic is whether the motor system contributes to speech perception, with several conflicting findings attested. In a dorsal-ventral speech stream framework (Hickok & Poeppel 2007, this debate is essentially about the roles of the dorsal versus ventral speech processing streams. A major roadblock in characterizing the neuroanatomy of speech perception is task-specific effects. For example, much of the evidence for dorsal stream involvement comes from syllable discrimination type tasks, which have been found to behaviorally doubly dissociate from auditory comprehension tasks (Baker et al. 1981. Discrimination task deficits could be a result of difficulty perceiving the sounds themselves, which is the typical assumption, or it could be a result of failures in temporary maintenance of the sensory traces, or the comparison and/or the decision process. Similar complications arise in perceiving sentences: the extent of inferior frontal (i.e. dorsal stream activation during listening to sentences increases as a function of increased task demands (Love et al. 2006. Another complication is the stimulus: much evidence for dorsal stream involvement uses speech samples lacking semantic context (CVs, non-words. The present study addresses these issues in a large-scale lesion-symptom mapping study. 158 patients with focal cerebral lesions from the Mutli-site Aphasia Research Consortium underwent a structural MRI or CT scan, as well as an extensive psycholinguistic battery. Voxel-based lesion symptom mapping was used to compare the neuroanatomy involved in the following speech perception tasks with varying phonological, semantic, and task loads: (i two discrimination tasks of syllables (non-words and words, respectively, (ii two auditory comprehension tasks
Ben-David, Boaz M.; Multani, Namita; Shakuf, Vered; Rudzicz, Frank; van Lieshout, Pascal H. H. M.
Purpose: Our aim is to explore the complex interplay of prosody (tone of speech) and semantics (verbal content) in the perception of discrete emotions in speech. Method: We implement a novel tool, the Test for Rating of Emotions in Speech. Eighty native English speakers were presented with spoken sentences made of different combinations of 5…
Viswanathan, Navin; Magnuson, James S.; Fowler, Carol A.
According to one approach to speech perception, listeners perceive speech by applying general pattern matching mechanisms to the acoustic signal (e.g., Diehl, Lotto, & Holt, 2004). An alternative is that listeners perceive the phonetic gestures that structured the acoustic signal (e.g., Fowler, 1986). The two accounts have offered different…
Lavie, Limor; Banai, Karen; Karni, Avi; Attias, Joseph
Purpose: We tested whether using hearing aids can improve unaided performance in speech perception tasks in older adults with hearing impairment. Method: Unaided performance was evaluated in dichotic listening and speech-in-noise tests in 47 older adults with hearing impairment; 36 participants in 3 study groups were tested before hearing aid…
Verheij, Emmy; Oomen, Karin P Q; Smetsers, Stephanie E; van Zanten, Gijsbert A; Speleman, Lucienne
Fanconi anemia is a hereditary chromosomal instability disorder. Hearing loss and ear abnormalities are among the many manifestations reported in this disorder. In addition, Fanconi anemia patients often complain about hearing difficulties in situations with background noise (speech perception in noise difficulties). Our study aimed to describe the prevalence of hearing loss and speech perception in noise difficulties in Dutch Fanconi anemia patients. Retrospective chart review. A retrospective chart review was conducted at a Dutch tertiary care center. All patients with Fanconi anemia at clinical follow-up in our hospital were included. Medical files were reviewed to collect data on hearing loss and speech perception in noise difficulties. In total, 49 Fanconi anemia patients were included. Audiograms were available in 29 patients and showed hearing loss in 16 patients (55%). Conductive hearing loss was present in 24.1%, sensorineural in 20.7%, and mixed in 10.3%. A speech in noise test was performed in 17 patients; speech perception in noise was subnormal in nine patients (52.9%) and abnormal in two patients (11.7%). Hearing loss and speech perception in noise abnormalities are common in Fanconi anemia. Therefore, pure tone audiograms and speech in noise tests should be performed, preferably already at a young age, because hearing aids or assistive listening devices could be very valuable in developing language and communication skills. 4. Laryngoscope, 127:2358-2361, 2017. © 2016 The American Laryngological, Rhinological and Otological Society, Inc.
Davidson, Lisa S; Geers, Ann E; Blamey, Peter J; Tobey, Emily A; Brenner, Christine A
The objectives of this report are to (1) describe the speech perception abilities of long-term pediatric cochlear implant (CI) recipients by comparing scores obtained at elementary school (CI-E, 8 to 9 yrs) with scores obtained at high school (CI-HS, 15 to 18 yrs); (2) evaluate speech perception abilities in demanding listening conditions (i.e., noise and lower intensity levels) at adolescence; and (3) examine the relation of speech perception scores to speech and language development over this longitudinal timeframe. All 112 teenagers were part of a previous nationwide study of 8- and 9-yr-olds (N = 181) who received a CI between 2 and 5 yrs of age. The test battery included (1) the Lexical Neighborhood Test (LNT; hard and easy word lists); (2) the Bamford Kowal Bench sentence test; (3) the Children's Auditory-Visual Enhancement Test; (4) the Test of Auditory Comprehension of Language at CI-E; (5) the Peabody Picture Vocabulary Test at CI-HS; and (6) the McGarr sentences (consonants correct) at CI-E and CI-HS. CI-HS speech perception was measured in both optimal and demanding listening conditions (i.e., background noise and low-intensity level). Speech perception scores were compared based on age at test, lexical difficulty of stimuli, listening environment (optimal and demanding), input mode (visual and auditory-visual), and language age. All group mean scores significantly increased with age across the two test sessions. Scores of adolescents significantly decreased in demanding listening conditions. The effect of lexical difficulty on the LNT scores, as evidenced by the difference in performance between easy versus hard lists, increased with age and decreased for adolescents in challenging listening conditions. Calculated curves for percent correct speech perception scores (LNT and Bamford Kowal Bench) and consonants correct on the McGarr sentences plotted against age-equivalent language scores on the Test of Auditory Comprehension of Language and Peabody
Cholinergic Potentiation and Audiovisual Repetition-Imitation Therapy Improve Speech Production and Communication Deficits in a Person with Crossed Aphasia by Inducing Structural Plasticity in White Matter Tracts
Marcelo L. Berthier
Full Text Available Donepezil (DP, a cognitive-enhancing drug targeting the cholinergic system, combined with massed sentence repetition training augmented and speeded up recovery of speech production deficits in patients with chronic conduction aphasia and extensive left hemisphere infarctions (Berthier et al., 2014. Nevertheless, a still unsettled question is whether such improvements correlate with restorative structural changes in gray matter and white matter pathways mediating speech production. In the present study, we used pharmacological magnetic resonance imaging to study treatment-induced brain changes in gray matter and white matter tracts in a right-handed male with chronic conduction aphasia and a right subcortical lesion (crossed aphasia. A single-patient, open-label multiple-baseline design incorporating two different treatments and two post-treatment evaluations was used. The patient received an initial dose of DP (5 mg/day which was maintained during 4 weeks and then titrated up to 10 mg/day and administered alone (without aphasia therapy during 8 weeks (Endpoint 1. Thereafter, the drug was combined with an audiovisual repetition-imitation therapy (Look-Listen-Repeat, LLR during 3 months (Endpoint 2. Language evaluations, diffusion weighted imaging (DWI, and voxel-based morphometry (VBM were performed at baseline and at both endpoints in JAM and once in 21 healthy control males. Treatment with DP alone and combined with LLR therapy induced marked improvement in aphasia and communication deficits as well as in selected measures of connected speech production, and phrase repetition. The obtained gains in speech production remained well-above baseline scores even 4 months after ending combined therapy. Longitudinal DWI showed structural plasticity in the right frontal aslant tract and direct segment of the arcuate fasciculus with both interventions. VBM revealed no structural changes in other white matter tracts nor in cortical areas linked by these
Full Text Available Background: Cochlear implants (CI for the rehabilitation of patients with profound or total bilateral sensorineural hypoacusis represent the initial use of electrical fields to provide audibility in cases where the use of sound amplifiers does not provide satisfactory results. Aims: To compare speech perception performance after cochlear implantation in children with connexin 26-associated deafness with that of a control group of children with deafness of unknown etiology. Study Design: Retrospective comparative study. Methods: During the period from 2006 to , cochlear implantation was performed on 26 children. Eighteen of these children had undergone genetic tests for mutation of the Gap Junction Protein Beta 2 (GJB2 gene. Bi-allelic GJB2 mutations were confirmed in 7 out of 18 examined children. In order to confirm whether genetic factors have influence on speech perception after cochlear implantation, we compared the post-implantation speech performance of seven children with mutations of the GBJ2 (connexin 26 gene with seven other children who had the wild type version of this particular gene. The latter were carefully matched according to the age at cochlear implantation. Speech perception performance was measured before cochlear implantation, and one and two years after implantation. All the patients were arranged in line with the appropriate speech perception category (SPC. Non-parametric tests, Friedman ANOVA and Mann-Whitney’s U test were used for statistical analysis. Results: Both groups showed similar improvements in speech perception scores after cochlear implantation. Statistical analysis did not confirm significant differences between the groups 12 and 24 months after cochlear implantation. Conclusion: The results obtained in this study showed an absence of apparent distinctions in the scores of speech perception between the two examined groups and therefore might have significant implications in selecting prognostic indicators
Xi, Ji; Liang, Ruiyu; Fei, Xianju
In this paper, a speech emotion recognition (SER) algorithm was proposed to improve the emotional perception of hearing-impaired people. The algorithm utilizes multiple kernel technology to overcome the drawback of SVM: slow training speed. Firstly, in order to improve the adaptive performance of Gaussian Radial Basis Function (RBF), the parameter determining the nonlinear mapping was optimized on the basis of Kernel target alignment. Then, the obtained Kernel Function was used as the basis kernel of Multiple Kernel Learning (MKL) with slack variable that could solve the over-fitting problem. However, the slack variable also brings the error into the result. Therefore, a soft-margin MKL was proposed to balance the margin against the error. Moreover, the relatively iterative algorithm was used to solve the combination coefficients and hyper-plane equations. Experimental results show that the proposed algorithm can acquire an accuracy of 90% for five kinds of emotions including happiness, sadness, anger, fear and neutral. Compared with KPCA+CCA and PIM-FSVM, the proposed algorithm has the highest accuracy.
Song, Jae-Jin; Lee, Hyo-Jeong; Kang, Hyejin; Lee, Dong Soo; Chang, Sun O; Oh, Seung Ha
While deafness-induced plasticity has been investigated in the visual and auditory domains, not much is known about language processing in audiovisual multimodal environments for patients with restored hearing via cochlear implant (CI) devices. Here, we examined the effect of agreeing or conflicting visual inputs on auditory processing in deaf patients equipped with degraded artificial hearing. Ten post-lingually deafened CI users with good performance, along with matched control subjects, underwent H 2 (15) O-positron emission tomography scans while carrying out a behavioral task requiring the extraction of speech information from unimodal auditory stimuli, bimodal audiovisual congruent stimuli, and incongruent stimuli. Regardless of congruency, the control subjects demonstrated activation of the auditory and visual sensory cortices, as well as the superior temporal sulcus, the classical multisensory integration area, indicating a bottom-up multisensory processing strategy. Compared to CI users, the control subjects exhibited activation of the right ventral premotor-supramarginal pathway. In contrast, CI users activated primarily the visual cortices more in the congruent audiovisual condition than in the null condition. In addition, compared to controls, CI users displayed an activation focus in the right amygdala for congruent audiovisual stimuli. The most notable difference between the two groups was an activation focus in the left inferior frontal gyrus in CI users confronted with incongruent audiovisual stimuli, suggesting top-down cognitive modulation for audiovisual conflict. Correlation analysis revealed that good speech performance was positively correlated with right amygdala activity for the congruent condition, but negatively correlated with bilateral visual cortices regardless of congruency. Taken together these results suggest that for multimodal inputs, cochlear implant users are more vision-reliant when processing congruent stimuli and are disturbed
Ingvalson, Erin M; Dhar, Sumitrajit; Wong, Patrick C M; Liu, Hanjun
Working memory capacity has been linked to performance on many higher cognitive tasks, including the ability to perceive speech in noise. Current efforts to train working memory have demonstrated that working memory performance can be improved, suggesting that working memory training may lead to improved speech perception in noise. A further advantage of working memory training to improve speech perception in noise is that working memory training materials are often simple, such as letters or digits, making them easily translatable across languages. The current effort tested the hypothesis that working memory training would be associated with improved speech perception in noise and that materials would easily translate across languages. Native Mandarin Chinese and native English speakers completed ten days of reversed digit span training. Reading span and speech perception in noise both significantly improved following training, whereas untrained controls showed no gains. These data suggest that working memory training may be used to improve listeners' speech perception in noise and that the materials may be quickly adapted to a wide variety of listeners.
Eg, Ragnhild; Behne, Dawn M
In well-controlled laboratory experiments, researchers have found that humans can perceive delays between auditory and visual signals as short as 20 ms. Conversely, other experiments have shown that humans can tolerate audiovisual asynchrony that exceeds 200 ms. This seeming contradiction in human temporal sensitivity can be attributed to a number of factors such as experimental approaches and precedence of the asynchronous signals, along with the nature, duration, location, complexity and repetitiveness of the audiovisual stimuli, and even individual differences. In order to better understand how temporal integration of audiovisual events occurs in the real world, we need to close the gap between the experimental setting and the complex setting of everyday life. With this work, we aimed to contribute one brick to the bridge that will close this gap. We compared perceived synchrony for long-running and eventful audiovisual sequences to shorter sequences that contain a single audiovisual event, for three types of content: action, music, and speech. The resulting windows of temporal integration showed that participants were better at detecting asynchrony for the longer stimuli, possibly because the long-running sequences contain multiple corresponding events that offer audiovisual timing cues. Moreover, the points of subjective simultaneity differ between content types, suggesting that the nature of a visual scene could influence the temporal perception of events. An expected outcome from this type of experiment was the rich variation among participants' distributions and the derived points of subjective simultaneity. Hence, the designs of similar experiments call for more participants than traditional psychophysical studies. Heeding this caution, we conclude that existing theories on multisensory perception are ready to be tested on more natural and representative stimuli.
Calcus, Axelle; Lorenzi, Christian; Collet, Gregory; Colin, Cécile; Kolinsky, Régine
Purpose: Children with dyslexia have been suggested to experience deficits in both categorical perception (CP) and speech identification in noise (SIN) perception. However, results regarding both abilities are inconsistent, and the relationship between them is still unclear. Therefore, this study aimed to investigate the relationship between CP…
Judge, Simon; Robertson, Zoë; Hawley, Mark; Enderby, Pam
To explore users' experiences and perceptions of speech-driven environmental control systems (SPECS) as part of a larger project aiming to develop a new SPECS. The motivation for this part of the project was to add to the evidence base for the use of SPECS and to determine the key design specifications for a new speech-driven system from a user's perspective. Semi-structured interviews were conducted with 12 users of SPECS from around the United Kingdom. These interviews were transcribed and analysed using a qualitative method based on framework analysis. Reliability is the main influence on the use of SPECS. All the participants gave examples of occasions when their speech-driven system was unreliable; in some instances, this unreliability was reported as not being a problem (e.g., for changing television channels); however, it was perceived as a problem for more safety critical functions (e.g., opening a door). Reliability was cited by participants as the reason for using a switch-operated system as back up. Benefits of speech-driven systems focused on speech operation enabling access when other methods were not possible; quicker operation and better aesthetic considerations. Overall, there was a perception of increased independence from the use of speech-driven environmental control. In general, speech was considered a useful method of operating environmental controls by the participants interviewed; however, their perceptions regarding reliability often influenced their decision to have backup or alternative systems for certain functions.
Calandruccio, Lauren; Gomez, Bianca; Buss, Emily; Leibold, Lori J
The purpose of this study was to develop a task to evaluate children's English and Spanish speech perception abilities in either noise or competing speech maskers. Eight bilingual Spanish-English and 8 age-matched monolingual English children (ages 4.9-16.4 years) were tested. A forced-choice, picture-pointing paradigm was selected for adaptively estimating masked speech reception thresholds. Speech stimuli were spoken by simultaneous bilingual Spanish-English talkers. The target stimuli were 30 disyllabic English and Spanish words, familiar to 5-year-olds and easily illustrated. Competing stimuli included either 2-talker English or 2-talker Spanish speech (corresponding to target language) and spectrally matched noise. For both groups of children, regardless of test language, performance was significantly worse for the 2-talker than for the noise masker condition. No difference in performance was found between bilingual and monolingual children. Bilingual children performed significantly better in English than in Spanish in competing speech. For all listening conditions, performance improved with increasing age. Results indicated that the stimuli and task were appropriate for speech recognition testing in both languages, providing a more conventional measure of speech-in-noise perception as well as a measure of complex listening. Further research is needed to determine performance for Spanish-dominant listeners and to evaluate the feasibility of implementation into routine clinical use.
Drawing on recent developments in the cortical organization of vision, and on data from a variety of sources, Hickok and Poeppel (2000) have proposed a new model of the functional anatomy of speech perception. The model posits that early cortical stages of speech perception involve auditory fields in the superior temporal gyrus bilaterally (although asymmetrically). This cortical processing system then diverges into two broad processing streams, a ventral stream, involved in mapping sound onto meaning, and a dorsal stream, involved in mapping sound onto articulatory-based representations. The ventral stream projects ventrolaterally toward inferior posterior temporal cortex which serves as an interface between sound and meaning. The dorsal stream projects dorsoposteriorly toward the parietal lobe and ultimately to frontal regions. This network provides a mechanism for the development and maintenance of ``parity'' between auditory and motor representations of speech. Although the dorsal stream represents a tight connection between speech perception and speech production, it is not a critical component of the speech perception process under ecologically natural listening conditions. Some degree of bi-directionality in both the dorsal and ventral pathways is also proposed. A variety of recent empirical tests of this model have provided further support for the proposal.
Full Text Available For deaf individuals with residual low-frequency acoustic hearing, combined use of a cochlear implant (CI and hearing aid (HA typically provides better speech understanding than with either device alone. Because of coarse spectral resolution, CIs do not provide fundamental frequency (F0 information that contributes to understanding of tonal languages such as Mandarin Chinese. The HA can provide good representation of F0 and, depending on the range of aided acoustic hearing, first and second formant (F1 and F2 information. In this study, Mandarin tone, vowel, and consonant recognition in quiet and noise was measured in 12 adult Mandarin-speaking bimodal listeners with the CI-only and with the CI+HA. Tone recognition was significantly better with the CI+HA in noise, but not in quiet. Vowel recognition was significantly better with the CI+HA in quiet, but not in noise. There was no significant difference in consonant recognition between the CI-only and the CI+HA in quiet or in noise. There was a wide range in bimodal benefit, with improvements often greater than 20 percentage points in some tests and conditions. The bimodal benefit was compared to CI subjects' HA-aided pure-tone average (PTA thresholds between 250 and 2000 Hz; subjects were divided into two groups: "better" PTA (50 dB HL. The bimodal benefit differed significantly between groups only for consonant recognition. The bimodal benefit for tone recognition in quiet was significantly correlated with CI experience, suggesting that bimodal CI users learn to better combine low-frequency spectro-temporal information from acoustic hearing with temporal envelope information from electric hearing. Given the small number of subjects in this study (n = 12, further research with Chinese bimodal listeners may provide more information regarding the contribution of acoustic and electric hearing to tonal language perception.
The general topic addressed by this dissertation is that of bilingualism, and more specifically, the topic of bilingual acquisition of speech sounds. The central question in this study is the following: does bilingualism affect children’s perceptual development of speech sounds? The term bilingual
Koller, Roger; Guignard, Jérémie; Caversaccio, Marco; Kompis, Martin; Senn, Pascal
Background Telecommunication is limited or even impossible for more than one-thirds of all cochlear implant (CI) users. Objective We sought therefore to study the impact of voice quality on speech perception with voice over Internet protocol (VoIP) under real and adverse network conditions. Methods Telephone speech perception was assessed in 19 CI users (15-69 years, average 42 years), using the German HSM (Hochmair-Schulz-Moser) sentence test comparing Skype and conventional telephone (public switched telephone networks, PSTN) transmission using a personal computer (PC) and a digital enhanced cordless telecommunications (DECT) telephone dual device. Five different Internet transmission quality modes and four accessories (PC speakers, headphones, 3.5 mm jack audio cable, and induction loop) were compared. As a secondary outcome, the subjective perceived voice quality was assessed using the mean opinion score (MOS). Results Speech telephone perception was significantly better (median 91.6%, P 15%) were not superior to conventional telephony. In addition, there were no significant differences between the tested accessories (P>.05) using a PC. Coupling a Skype DECT phone device with an audio cable to the CI, however, resulted in higher speech perception (median 65%) and subjective MOS scores (3.2) than using PSTN (median 7.5%, P<.001). Conclusions Skype calls significantly improve speech perception for CI users compared with conventional telephony under real network conditions. Listening accessories do not further improve listening experience. Current Skype DECT telephone devices do not fully offer technical advantages in voice quality. PMID:28438727
Full Text Available Adult speech perception reflects the long-term regularities of the native language, but it is also flexible such that it accommodates and adapts to adverse listening conditions and short-term deviations from native-language norms. The purpose of this review article is to examine how the broader neuroscience literature can inform and advance research efforts in understanding the neural basis of flexibility and adaptive plasticity in speech perception. In particular, we consider several domains of neuroscience research that offer insight into how perception can be adaptively tuned to short-term deviations while also maintaining without affecting the long-term learned regularities for mapping sensory input. We review several literatures to highlight the potential role of learning algorithms that rely on prediction error signals and discuss specific neural structures that are likely to contribute to such learning. Already, a few studies have alluded to a potential role of these mechanisms in adaptive plasticity in speech perception. Better understanding the application and limitations of these algorithms for the challenges of flexible speech perception under adverse conditions promises to inform theoretical models of speech.
Archila-Meléndez, Mario E.; Valente, Giancarlo; Correia, Joao M.; Rouhl, Rob P. W.; van Kranen-Mastenbroek, Vivianne H.; Jansma, Bernadette M.
Sensorimotor integration, the translation between acoustic signals and motoric programs, may constitute a crucial mechanism for speech. During speech perception, the acoustic-motoric translations include the recruitment of cortical areas for the representation of speech articulatory features, such
Full Text Available Synesthesia entails a special kind of sensory perception, where stimulation in one sensory modality leads to an internally generated perceptual experience of another, not stimulated sensory modality. This phenomenon can be viewed as an abnormal multisensory integration process as here the synesthetic percept is aberrantly fused with the stimulated modality. Indeed, recent synesthesia research has focused on multimodal processing even outside of the specific synesthesia-inducing context and has revealed changed multimodal integration, thus suggesting perceptual alterations at a global level. Here, we focused on audio-visual processing in synesthesia using a semantic classification task in combination with visually or auditory-visually presented animated and inanimated objects in an audio-visual congruent and incongruent manner. Fourteen subjects with auditory-visual and/or grapheme-color synesthesia and 14 control subjects participated in the experiment. During presentation of the stimuli, event-related potentials were recorded from 32 electrodes. The analysis of reaction times and error rates revealed no group differences with best performance for audio-visually congruent stimulation indicating the well-known multimodal facilitation effect. We found an enhanced amplitude of the N1 component over occipital electrode sites for synesthetes compared to controls. The differences occurred irrespective of the experimental condition and therefore suggest a global influence on early sensory processing in synesthetes.
Full Text Available It has traditionally been assumed that cochlear implant users de facto perform atypically in audiovisual tasks. However, a recent study that combined an auditory task with visual distractors suggests that only those cochlear implant users that are not proficient at recognizing speech sounds might show abnormal audiovisual interactions. The present study aims at reinforcing this notion by investigating the audiovisual segregation abilities of cochlear implant users in a visual task with auditory distractors. Speechreading was assessed in two groups of cochlear implant users (proficient and non-proficient at sound recognition, as well as in normal controls. A visual speech recognition task (i.e. speechreading was administered either in silence or in combination with three types of auditory distractors: i noise ii reverse speech sound and iii non-altered speech sound. Cochlear implant users proficient at speech recognition performed like normal controls in all conditions, whereas non-proficient users showed significantly different audiovisual segregation patterns in both speech conditions. These results confirm that normal-like audiovisual segregation is possible in highly skilled cochlear implant users and, consequently, that proficient and non-proficient CI users cannot be lumped into a single group. This important feature must be taken into account in further studies of audiovisual interactions in cochlear implant users.
Full Text Available Speech perception is thought to be linked to speech motor production. This linkage is considered to mediate multimodal aspects of speech perception, such as audio-visual and audio-tactile integration. However, direct coupling between articulatory movement and auditory perception has been little studied. The present study reveals a clear dissociation between the effects of a listener's own speech action and the effects of viewing another's speech movements on the perception of auditory phonemes. We assessed the intelligibility of the syllables [pa], [ta], and [ka] when listeners silently and simultaneously articulated syllables that were congruent/incongruent with the syllables they heard. The intelligibility was compared with a condition where the listeners simultaneously watched another's mouth producing congruent/incongruent syllables, but did not articulate. The intelligibility of [ta] and [ka] were degraded by articulating [ka] and [ta] respectively, which are associated with the same primary articulator (tongue as the heard syllables. But they were not affected by articulating [pa], which is associated with a different primary articulator (lips from the heard syllables. In contrast, the intelligibility of [ta] and [ka] was degraded by watching the production of [pa]. These results indicate that the articulatory-induced distortion of speech perception occurs in an articulator-specific manner while visually induced distortion does not. The articulator-specific nature of the auditory-motor interaction in speech perception suggests that speech motor processing directly contributes to our ability to hear speech.
Cheng, Xiaoting; Liu, Yangwenyi; Shu, Yilai; Tao, Duo-Duo; Wang, Bing; Yuan, Yasheng; Galvin, John J; Fu, Qian-Jie; Chen, Bing
Due to limited spectral resolution, cochlear implants (CIs) do not convey pitch information very well. Pitch cues are important for perception of music and tonal language; it is possible that music training may improve performance in both listening tasks. In this study, we investigated music training outcomes in terms of perception of music, lexical tones, and sentences in 22 young (4.8 to 9.3 years old), prelingually deaf Mandarin-speaking CI users. Music perception was measured using a melodic contour identification (MCI) task. Speech perception was measured for lexical tones and sentences presented in quiet. Subjects received 8 weeks of MCI training using pitch ranges not used for testing. Music and speech perception were measured at 2, 4, and 8 weeks after training was begun; follow-up measures were made 4 weeks after training was stopped. Mean baseline performance was 33.2%, 76.9%, and 45.8% correct for MCI, lexical tone recognition, and sentence recognition, respectively. After 8 weeks of MCI training, mean performance significantly improved by 22.9, 14.4, and 14.5 percentage points for MCI, lexical tone recognition, and sentence recognition, respectively ( p music and speech performance. The results suggest that music training can significantly improve pediatric Mandarin-speaking CI users' music and speech perception.
Mitchel, Aaron D; Gerfen, Chip; Weiss, Daniel J
One challenge for speech perception is between-speaker variability in the acoustic parameters of speech. For example, the same phoneme (e.g. the vowel in "cat") may have substantially different acoustic properties when produced by two different speakers and yet the listener must be able to interpret these disparate stimuli as equivalent. Perceptual tuning, the use of contextual information to adjust phonemic representations, may be one mechanism that helps listeners overcome obstacles they face due to this variability during speech perception. Here we test whether visual contextual cues to speaker identity may facilitate the formation and maintenance of distributional representations for individual speakers, allowing listeners to adjust phoneme boundaries in a speaker-specific manner. We familiarized participants to an audiovisual continuum between /aba/ and /ada/. During familiarization, the "b-face" mouthed /aba/ when an ambiguous token was played, while the "D-face" mouthed /ada/. At test, the same ambiguous token was more likely to be identified as /aba/ when paired with a stilled image of the "b-face" than with an image of the "D-face." This was not the case in the control condition when the two faces were paired equally with the ambiguous token. Together, these results suggest that listeners may form speaker-specific phonemic representations using facial identity cues.
Tryfon, Ana; Foster, Nicholas E V; Sharda, Megha; Hyde, Krista L
Autism spectrum disorder (ASD) is often characterized by atypical language profiles and auditory and speech processing. These can contribute to aberrant language and social communication skills in ASD. The study of the neural basis of speech perception in ASD can serve as a potential neurobiological marker of ASD early on, but mixed results across studies renders it difficult to find a reliable neural characterization of speech processing in ASD. To this aim, the present study examined the functional neural basis of speech perception in ASD versus typical development (TD) using an activation likelihood estimation (ALE) meta-analysis of 18 qualifying studies. The present study included separate analyses for TD and ASD, which allowed us to examine patterns of within-group brain activation as well as both common and distinct patterns of brain activation across the ASD and TD groups. Overall, ASD and TD showed mostly common brain activation of speech processing in bilateral superior temporal gyrus (STG) and left inferior frontal gyrus (IFG). However, the results revealed trends for some distinct activation in the TD group showing additional activation in higher-order brain areas including left superior frontal gyrus (SFG), left medial frontal gyrus (MFG), and right IFG. These results provide a more reliable neural characterization of speech processing in ASD relative to previous single neuroimaging studies and motivate future work to investigate how these brain signatures relate to behavioral measures of speech processing in ASD. Copyright © 2017 Elsevier B.V. All rights reserved.
Konijn, Elly A.; Walma van der Molen, Juliette H.; van Nes, Sander
This study investigated whether emotions induced in TV-viewers (either as an emotional state or co-occurring with emotional involvement) would increase viewers' perception of realism in a fake documentary and affect the information value that viewers would attribute to its content. To that end, two experiments were conducted that manipulated (a)…
Magalhães, Ana Tereza de Matos; Goffi-Gomez, Maria Valéria Schmidt; Hoshino, Ana Cristina; Tsuji, Robinson Koji; Bento, Ricardo Ferreira; Brito, Rubens
New technology in the Freedom® speech processor for cochlear implants was developed to improve how incoming acoustic sound is processed; this applies not only for new users, but also for previous generations of cochlear implants. To identify the contribution of this technology-- the Nucleus 22®--on speech perception tests in silence and in noise, and on audiometric thresholds. A cross-sectional cohort study was undertaken. Seventeen patients were selected. The last map based on the Spectra® was revised and optimized before starting the tests. Troubleshooting was used to identify malfunction. To identify the contribution of the Freedom® technology for the Nucleus22®, auditory thresholds and speech perception tests were performed in free field in sound-proof booths. Recorded monosyllables and sentences in silence and in noise (SNR = 0dB) were presented at 60 dBSPL. The nonparametric Wilcoxon test for paired data was used to compare groups. Freedom® applied for the Nucleus22® showed a statistically significant difference in all speech perception tests and audiometric thresholds. The Freedom® technology improved the performance of speech perception and audiometric thresholds of patients with Nucleus 22®.
Dole, Marjorie; Hoen, Michel; Meunier, Fanny
Developmental dyslexia is associated with impaired speech-in-noise perception. The goal of the present research was to further characterize this deficit in dyslexic adults. In order to specify the mechanisms and processing strategies used by adults with dyslexia during speech-in-noise perception, we explored the influence of background type,…
Rogalsky, Corianne; Love, Tracy; Driscoll, David; Anderson, Steven W.; Hickok, Gregory
The discovery of mirror neurons in macaque has led to a resurrection of motor theories of speech perception. Although the majority of lesion and functional imaging studies have associated perception with the temporal lobes, it has also been proposed that the ‘human mirror system’, which prominently includes Broca’s area, is the neurophysiological substrate of speech perception. Although numerous studies have demonstrated a tight link between sensory and motor speech processes, few have directly assessed the critical prediction of mirror neuron theories of speech perception, namely that damage to the human mirror system should cause severe deficits in speech perception. The present study measured speech perception abilities of patients with lesions involving motor regions in the left posterior frontal lobe and/or inferior parietal lobule (i.e., the proposed human ‘mirror system’). Performance was at or near ceiling in patients with fronto-parietal lesions. It is only when the lesion encroaches on auditory regions in the temporal lobe that perceptual deficits are evident. This suggests that ‘mirror system’ damage does not disrupt speech perception, but rather that auditory systems are the primary substrate for speech perception. PMID:21207313
Snellings, Patrick; van der Leij, Aryan; Blok, Henk; de Jong, Peter F.
This study investigated the role of speech perception accuracy and speed in fluent word decoding of reading disabled (RD) children. A same-different phoneme discrimination task with natural speech tested the perception of single consonants and consonant clusters by young but persistent RD children. RD children were slower than chronological age…
Dubach, Patrick; Pfiffner, Flurin; Kompis, Martin; Caversaccio, Marco
Background Telephone communication is a challenge for many hearing-impaired individuals. One important technical reason for this difficulty is the restricted frequency range (0.3–3.4 kHz) of conventional landline telephones. Internet telephony (voice over Internet protocol [VoIP]) is transmitted with a larger frequency range (0.1–8 kHz) and therefore includes more frequencies relevant to speech perception. According to a recently published, laboratory-based study, the theoretical advantage of ideal VoIP conditions over conventional telephone quality has translated into improved speech perception by hearing-impaired individuals. However, the speech perception benefits of nonideal VoIP network conditions, which may occur in daily life, have not been explored. VoIP use cannot be recommended to hearing-impaired individuals before its potential under more realistic conditions has been examined. Objective To compare realistic VoIP network conditions, under which digital data packets may be lost, with ideal conventional telephone quality with respect to their impact on speech perception by hearing-impaired individuals. Methods We assessed speech perception using standardized test material presented under simulated VoIP conditions with increasing digital data packet loss (from 0% to 20%) and compared with simulated ideal conventional telephone quality. We monaurally tested 10 adult users of cochlear implants, 10 adult users of hearing aids, and 10 normal-hearing adults in the free sound field, both in quiet and with background noise. Results Across all participant groups, mean speech perception scores using VoIP with 0%, 5%, and 10% packet loss were 15.2% (range 0%–53%), 10.6% (4%–46%), and 8.8% (7%–33%) higher, respectively, than with ideal conventional telephone quality. Speech perception did not differ between VoIP with 20% packet loss and conventional telephone quality. The maximum benefits were observed under ideal VoIP conditions without packet loss and
Mantokoudis, Georgios; Dubach, Patrick; Pfiffner, Flurin; Kompis, Martin; Caversaccio, Marco; Senn, Pascal
Telephone communication is a challenge for many hearing-impaired individuals. One important technical reason for this difficulty is the restricted frequency range (0.3-3.4 kHz) of conventional landline telephones. Internet telephony (voice over Internet protocol [VoIP]) is transmitted with a larger frequency range (0.1-8 kHz) and therefore includes more frequencies relevant to speech perception. According to a recently published, laboratory-based study, the theoretical advantage of ideal VoIP conditions over conventional telephone quality has translated into improved speech perception by hearing-impaired individuals. However, the speech perception benefits of nonideal VoIP network conditions, which may occur in daily life, have not been explored. VoIP use cannot be recommended to hearing-impaired individuals before its potential under more realistic conditions has been examined. To compare realistic VoIP network conditions, under which digital data packets may be lost, with ideal conventional telephone quality with respect to their impact on speech perception by hearing-impaired individuals. We assessed speech perception using standardized test material presented under simulated VoIP conditions with increasing digital data packet loss (from 0% to 20%) and compared with simulated ideal conventional telephone quality. We monaurally tested 10 adult users of cochlear implants, 10 adult users of hearing aids, and 10 normal-hearing adults in the free sound field, both in quiet and with background noise. Across all participant groups, mean speech perception scores using VoIP with 0%, 5%, and 10% packet loss were 15.2% (range 0%-53%), 10.6% (4%-46%), and 8.8% (7%-33%) higher, respectively, than with ideal conventional telephone quality. Speech perception did not differ between VoIP with 20% packet loss and conventional telephone quality. The maximum benefits were observed under ideal VoIP conditions without packet loss and were 36% (P = .001) for cochlear implant users, 18
Millman, Rebecca E.; Mattys, Sven L.
Purpose: Background noise can interfere with our ability to understand speech. Working memory capacity (WMC) has been shown to contribute to the perception of speech in modulated noise maskers. WMC has been assessed with a variety of auditory and visual tests, often pertaining to different components of working memory. This study assessed the relationship between speech perception in modulated maskers and components of auditory verbal working memory (AVWM) over a range of signal-to-noise rati...
McNeel Gordon Jantzen
Full Text Available Musicians have a more accurate temporal and tonal representation of auditory stimuli than their non-musician counterparts (Kraus & Chandrasekaran, 2010; Parbery-Clark, Skoe, & Kraus, 2009; Zendel & Alain, 2008; Musacchia, Sams, Skoe, & Kraus, 2007. Musicians who are adept at the production and perception of music are also more sensitive to key acoustic features of speech such as voice onset timing and pitch. Together, these data suggest that musical training may enhance the processing of acoustic information for speech sounds. In the current study, we sought to provide neural evidence that musicians process speech and music in a similar way. We hypothesized that for musicians, right hemisphere areas traditionally associated with music are also engaged for the processing of speech sounds. In contrast we predicted that in non-musicians processing of speech sounds would be localized to traditional left hemisphere language areas. Speech stimuli differing in voice onset time was presented using a dichotic listening paradigm. Subjects either indicated aural location for a specified speech sound or identified a specific speech sound from a directed aural location. Musical training effects and organization of acoustic features were reflected by activity in source generators of the P50. This included greater activation of right middle temporal gyrus (MTG and superior temporal gyrus (STG in musicians. The findings demonstrate recruitment of right hemisphere in musicians for discriminating speech sounds and a putative broadening of their language network. Musicians appear to have an increased sensitivity to acoustic features and enhanced selective attention to temporal features of speech that is facilitated by musical training and supported, in part, by right hemisphere homologues of established speech processing regions of the brain.
Srinivasan, Arthi G; Padilla, Monica; Shannon, Robert V; Landsberger, David M
Cochlear implant (CI) users typically have excellent speech recognition in quiet but struggle with understanding speech in noise. It is thought that broad current spread from stimulating electrodes causes adjacent electrodes to activate overlapping populations of neurons which results in interactions across adjacent channels. Current focusing has been studied as a way to reduce spread of excitation, and therefore, reduce channel interactions. In particular, partial tripolar stimulation has been shown to reduce spread of excitation relative to monopolar stimulation. However, the crucial question is whether this benefit translates to improvements in speech perception. In this study, we compared speech perception in noise with experimental monopolar and partial tripolar speech processing strategies. The two strategies were matched in terms of number of active electrodes, microphone, filterbanks, stimulation rate and loudness (although both strategies used a lower stimulation rate than typical clinical strategies). The results of this study showed a significant improvement in speech perception in noise with partial tripolar stimulation. All subjects benefited from the current focused speech processing strategy. There was a mean improvement in speech recognition threshold of 2.7 dB in a digits in noise task and a mean improvement of 3 dB in a sentences in noise task with partial tripolar stimulation relative to monopolar stimulation. Although the experimental monopolar strategy was worse than the clinical, presumably due to different microphones, frequency allocations and stimulation rates, the experimental partial-tripolar strategy, which had the same changes, showed no acute deficit relative to the clinical. Copyright © 2013 Elsevier B.V. All rights reserved.
Chen, Liang; Lei, Jianghua
The present study aimed to investigate the development of visual speech perception in Chinese-speaking children. Children aged 7, 13 and 16 were asked to visually identify both consonant and vowel sounds in Chinese as quickly and accurately as possible. Results revealed (1) an increase in accuracy of visual speech perception between ages 7 and 13 after which the accuracy rate either stagnates or drops; and (2) a U-shaped development pattern in speed of perception with peak performance in 13-year olds. Results also showed that across all age groups, the overall levels of accuracy rose, whereas the response times fell for simplex finals, complex finals and initials. These findings suggest that (1) visual speech perception in Chinese is a developmental process that is acquired over time and is still fine-tuned well into late adolescence; (2) factors other than cross-linguistic differences in phonological complexity and degrees of reliance on visual information are involved in development of visual speech perception.
Full Text Available Language and music share many rhythmic properties, such as variations in intensity and duration leading to repeating patterns. Perception of rhythmic properties may rely on cognitive networks that are shared between the two domains. If so, then variability in speech rhythm perception may relate to individual differences in musicality. To examine this possibility, the present study focuses on rhythmic grouping, which is assumed to be guided by a domain-general principle, the Iambic/Trochaic law, stating that sounds alternating in intensity are grouped as strong-weak, and sounds alternating in duration are grouped as weak-strong. German listeners completed a grouping task: They heard streams of syllables alternating in intensity, duration, or neither, and had to indicate whether they perceived a strong-weak or weak-strong pattern. Moreover, their music perception abilities were measured, and they filled out a questionnaire reporting their productive musical experience. Results showed that better musical rhythm perception ability was associated with more consistent rhythmic grouping of speech, while melody perception ability and productive musical experience were not. This suggests shared cognitive procedures in the perception of rhythm in music and speech. Also, the results highlight the relevance of considering individual differences in musicality when aiming to explain variability in prosody perception.
Full Text Available The well-established left hemisphere specialisation for language processing has long been claimed to be based on a low-level auditory specialization for specific acoustic features in speech, particularly regarding 'rapid temporal processing'.A novel analysis/synthesis technique was used to construct a variety of sounds based on simple sentences which could be manipulated in spectro-temporal complexity, and whether they were intelligible or not. All sounds consisted of two noise-excited spectral prominences (based on the lower two formants in the original speech which could be static or varying in frequency and/or amplitude independently. Dynamically varying both acoustic features based on the same sentence led to intelligible speech but when either or both acoustic features were static, the stimuli were not intelligible. Using the frequency dynamics from one sentence with the amplitude dynamics of another led to unintelligible sounds of comparable spectro-temporal complexity to the intelligible ones. Positron emission tomography (PET was used to compare which brain regions were active when participants listened to the different sounds.Neural activity to spectral and amplitude modulations sufficient to support speech intelligibility (without actually being intelligible was seen bilaterally, with a right temporal lobe dominance. A left dominant response was seen only to intelligible sounds. It thus appears that the left hemisphere specialisation for speech is based on the linguistic properties of utterances, not on particular acoustic features.
Full Text Available Previous studies have shown that a similar straightforward relationship exists between lightness and pitch in synaesthetes and nonsynaesthetes (Mark, 1974; Hubbard, 1996; Ward et al., 2006. These results indicated that nonsynaesthetes have similar lightness-pitch mapping with synaesthetes. However, in their experimental paradigm such a similarity seems to stem from not so much sensory phenomena as high-level memory associations in nonsynaesthetes. So, we examined what process in perception and/or cognition relates lightness-pitch synaesthesia-like phenomena in nonsynaethetes. In our study we targeted perceived colors (luminance per se rather than the imagery of color (luminance via memory association. Results indicated that the performance in color selection were affected by task-irrelevant stimuli (auditory information, but there was little significant correlation between color and auditory stimuli (pitch in simple color conditions. However, in subjective figures conditions, results showed a different tendency and partly showed correlations. Recent work indicates synaesthesia needs selective attention (Rich et al., 2010 and some research shows perception of subjective contours need attention (Gurnsey, 1996. We conjecture that lightness-pitch synaesthesia-like phenomena may need some kind of attention or the other higher brain activity (eg, memory association, cognition.
Fuller, Christina D; Galvin, John J; Maat, Bert; Başkent, Deniz; Free, Rolien H
In normal-hearing (NH) adults, long-term music training may benefit music and speech perception, even when listening to spectro-temporally degraded signals as experienced by cochlear implant (CI) users. In this study, we compared two different music training approaches in CI users and their effects on speech and music perception, as it remains unclear which approach to music training might be best. The approaches differed in terms of music exercises and social interaction. For the pitch/timbre group, melodic contour identification (MCI) training was performed using computer software. For the music therapy group, training involved face-to-face group exercises (rhythm perception, musical speech perception, music perception, singing, vocal emotion identification, and music improvisation). For the control group, training involved group nonmusic activities (e.g., writing, cooking, and woodworking). Training consisted of weekly 2-hr sessions over a 6-week period. Speech intelligibility in quiet and noise, vocal emotion identification, MCI, and quality of life (QoL) were measured before and after training. The different training approaches appeared to offer different benefits for music and speech perception. Training effects were observed within-domain (better MCI performance for the pitch/timbre group), with little cross-domain transfer of music training (emotion identification significantly improved for the music therapy group). While training had no significant effect on QoL, the music therapy group reported better perceptual skills across training sessions. These results suggest that more extensive and intensive training approaches that combine pitch training with the social aspects of music therapy may further benefit CI users.
Jepsen, Morten Løve
in a diagnostic rhyme test. The framework was constructed such that discrimination errors originating from the front-end and the back-end were separated. The front-end was fitted to individual listeners with cochlear hearing loss according to non-speech data, and speech data were obtained in the same listeners......A better understanding of how the human auditory system represents and analyzes sounds and how hearing impairment affects such processing is of great interest for researchers in the fields of auditory neuroscience, audiology, and speech communication as well as for applications in hearing......-instrument and speech technology. In this thesis, the primary focus was on the development and evaluation of a computational model of human auditory signal-processing and perception. The model was initially designed to simulate the normal-hearing auditory system with particular focus on the nonlinear processing...
Davis, Matthew H.; Coleman, Martin R.; Absalom, Anthony R.; Rodd, Jennifer M.; Johnsrude, Ingrid S.; Matta, Basil F.; Owen, Adrian M.; Menon, David K.
We used functional MRI and the anesthetic agent propofol to assess the relationship among neural responses to speech, successful comprehension, and conscious awareness. Volunteers were scanned while listening to sentences containing ambiguous words, matched sentences without ambiguous words, and signal-correlated noise (SCN). During three scanning sessions, participants were nonsedated (awake), lightly sedated (a slowed response to conversation), and deeply sedated (no conversational response, rousable by loud command). Bilateral temporal-lobe responses for sentences compared with signal-correlated noise were observed at all three levels of sedation, although prefrontal and premotor responses to speech were absent at the deepest level of sedation. Additional inferior frontal and posterior temporal responses to ambiguous sentences provide a neural correlate of semantic processes critical for comprehending sentences containing ambiguous words. However, this additional response was absent during light sedation, suggesting a marked impairment of sentence comprehension. A significant decline in postscan recognition memory for sentences also suggests that sedation impaired encoding of sentences into memory, with left inferior frontal and temporal lobe responses during light sedation predicting subsequent recognition memory. These findings suggest a graded degradation of cognitive function in response to sedation such that “higher-level” semantic and mnemonic processes can be impaired at relatively low levels of sedation, whereas perceptual processing of speech remains resilient even during deep sedation. These results have important implications for understanding the relationship between speech comprehension and awareness in the healthy brain in patients receiving sedation and in patients with disorders of consciousness. PMID:17938125
Sadakata, M.; Zanden, L.D.T. van der; Sekiyama, K.
The current study reports specific cases in which a positive transfer of perceptual ability from the music domain to the language domain occurs. We tested whether musical training enhances discrimination and identification performance of L2 speech sounds (timing features, nasal consonants and
Fowler, Jennifer R.; Eggleston, Jessica L.; Reavis, Kelly M.; McMillan, Garnett P.; Reiss, Lina A. J.
Purpose: The objective was to determine whether speech perception could be improved for bimodal listeners (those using a cochlear implant [CI] in one ear and hearing aid in the contralateral ear) by removing low-frequency information provided by the CI, thereby reducing acoustic-electric overlap. Method: Subjects were adult CI subjects with at…
Rotteveel, L.J.C.; Snik, A.F.M.; Cooper, H.; Mawman, D.J.; Olphen, A.F. van; Mylanus, E.A.M.
OBJECTIVES: To analyse the speech perception performance of 53 cochlear implant recipients with otosclerosis and to evaluate which factors influenced patient performance in this group. The factors included disease-related data such as demographics, pre-operative audiological characteristics, the
Beck, Ann R.; Dennis, Marcia
Speech-language pathologists (N=21) and teachers (N=54) were surveyed regarding their perceptions of classroom-based interventions. The two groups agreed about the primary advantages and disadvantages of most interventions, the primary areas of difference being classroom management and ease of data collection. Other findings indicated few…
Fuller, Christina D; Galvin, John J; Maat, Bert; Başkent, Deniz; Free, Rolien H
In normal-hearing (NH) adults, long-term music training may benefit music and speech perception, even when listening to spectro-temporally degraded signals as experienced by cochlear implant (CI) users. In this study, we compared two different music training approaches in CI users and their effects
Kleinschmidt, Dave F.; Jaeger, T. Florian
Successful speech perception requires that listeners map the acoustic signal to linguistic categories. These mappings are not only probabilistic, but change depending on the situation. For example, one talker’s /p/ might be physically indistinguishable from another talker’s /b/ (cf. lack of invariance). We characterize the computational problem posed by such a subjectively non-stationary world and propose that the speech perception system overcomes this challenge by (1) recognizing previously encountered situations, (2) generalizing to other situations based on previous similar experience, and (3) adapting to novel situations. We formalize this proposal in the ideal adapter framework: (1) to (3) can be understood as inference under uncertainty about the appropriate generative model for the current talker, thereby facilitating robust speech perception despite the lack of invariance. We focus on two critical aspects of the ideal adapter. First, in situations that clearly deviate from previous experience, listeners need to adapt. We develop a distributional (belief-updating) learning model of incremental adaptation. The model provides a good fit against known and novel phonetic adaptation data, including perceptual recalibration and selective adaptation. Second, robust speech recognition requires listeners learn to represent the structured component of cross-situation variability in the speech signal. We discuss how these two aspects of the ideal adapter provide a unifying explanation for adaptation, talker-specificity, and generalization across talkers and groups of talkers (e.g., accents and dialects). The ideal adapter provides a guiding framework for future investigations into speech perception and adaptation, and more broadly language comprehension. PMID:25844873
Delphi, Maryam; Lotfi, M-Yones; Moossavi, Abdollah; Bakhshi, Enayatollah; Banimostafa, Maryam
Previous studies have shown that interaural-time-difference (ITD) training can improve localization ability. Surprisingly little is, however, known about localization training vis-à-vis speech perception in noise based on interaural time difference in the envelope (ITD ENV). We sought to investigate the reliability of an ITD ENV-based training program in speech-in-noise perception among elderly individuals with normal hearing and speech-in-noise disorder. The present interventional study was performed during 2016. Sixteen elderly men between 55 and 65 years of age with the clinical diagnosis of normal hearing up to 2000 Hz and speech-in-noise perception disorder participated in this study. The training localization program was based on changes in ITD ENV. In order to evaluate the reliability of the training program, we performed speech-in-noise tests before the training program, immediately afterward, and then at 2 months' follow-up. The reliability of the training program was analyzed using the Friedman test and the SPSS software. Significant statistical differences were shown in the mean scores of speech-in-noise perception between the 3 time points (P=0.001). The results also indicated no difference in the mean scores of speech-in-noise perception between the 2 time points of immediately after the training program and 2 months' follow-up (P=0.212). The present study showed the reliability of an ITD ENV-based localization training in elderly individuals with speech-in-noise perception disorder.
Tremblay, Pascale; Small, Steven L.
What is the nature of the interface between speech perception and production, where auditory and motor representations converge? One set of explanations suggests that during perception, the motor circuits involved in producing a perceived action are in some way enacting the action without actually causing movement (covert simulation) or sending along the motor information to be used to predict its sensory consequences (i.e., efference copy). Other accounts either reject entirely the involvement of motor representations in perception, or explain their role as being more supportive than integral, and not employing the identical circuits used in production. Using fMRI, we investigated whether there are brain regions that are conjointly active for both speech perception and production, and whether these regions are sensitive to articulatory (syllabic) complexity during both processes, which is predicted by a covert simulation account. A group of healthy young adults (1) observed a female speaker produce a set of familiar words (perception), and (2) observed and then repeated the words (production). There were two types of words, varying in articulatory complexity, as measured by the presence or absence of consonant clusters. The simple words contained no consonant cluster (e.g. “palace”), while the complex words contained one to three consonant clusters (e.g. “planet”). Results indicate that the left ventral premotor cortex (PMv) was significantly active during speech perception and speech production but that activation in this region was scaled to articulatory complexity only during speech production, revealing an incompletely specified efferent motor signal during speech perception. The right planum temporal (PT) was also active during speech perception and speech production, and activation in this region was scaled to articulatory complexity during both production and perception. These findings are discussed in the context of current theories theory of
Full Text Available Introduction:Previous studies have highlighted the advantage of audio–visual oddball tasks (instead of unimodal ones in order to electrophysiologically index subclinical behavioral differences. Since alexithymia is highly prevalent in the general population, we investigated whether the use of various bimodal tasks could elicit emotional effects in low- versus high-alexithymic scorers. Methods:Fifty students (33 females were split into groups based on low and high scores on the Toronto Alexithymia Scale. During event-related potential recordings, they were exposed to three kinds of audio–visual oddball tasks: neutral (geometrical forms and bips, animal (dog and cock with their respective shouts, or emotional (faces and voices stimuli. In each condition, participants were asked to quickly detect deviant events occurring amongst a train of frequent matching stimuli (e.g., push a button when a sad face–voice pair appeared amongst a train of neutral face–voice pairs. P100, N100, and P300 components were analyzed: P100 refers to visual perceptive processing, N100 to auditory ones, and the P300 relates to response-related stages. Results:High-alexithymic scorers presented a particular pattern of results when processing the emotional stimulations, reflected in early ERP components by increased P100 and N100 amplitudes in the emotional oddball tasks (P100: pConclusions:Our findings suggest that high-alexithymic scorers require heightened early attentional resources when confronted with emotional stimuli.
Davis, Matthew H.; Coleman, Martin R.; Absalom, Anthony R.; Rodd, Jennifer M.; Johnsrude, Ingrid S.; Matta, Basil F.; Owen, Adrian M.; Menon, David K.
We used functional MRI and the anesthetic agent propofol to assess the relationship among neural responses to speech, successful comprehension, and conscious awareness. Volunteers were scanned while listening to sentences containing ambiguous words, matched sentences without ambiguous words, and signal-correlated noise (SCN). During three scanning sessions, participants were nonsedated (awake), lightly sedated (a slowed response to conversation), and deeply sedated (no conversational response...
Frijns, Johan H M; Snel-Bongers, Jorien; Vellinga, Dirk; Schrage, Erik; Vanpoucke, Filiep J; Briaire, Jeroen J
Even with six defective contacts, spanning can largely restore speech perception with the HiRes 120 speech processing strategy to the level supported by an intact electrode array. Moreover, the sound quality is not degraded. Previous studies have demonstrated reduced speech perception scores (SPS) with defective contacts in HiRes 120. This study investigated whether replacing defective contacts by spanning, i.e. current steering on non-adjacent contacts, is able to restore speech recognition to the level supported by an intact electrode array. Ten adult cochlear implant recipients (HiRes90K, HiFocus1J) with experience with HiRes 120 participated in this study. Three different defective electrode arrays were simulated (six separate defective contacts, three pairs or two triplets). The participants received three take-home strategies and were asked to evaluate the sound quality in five predefined listening conditions. After 3 weeks, SPS were evaluated with monosyllabic words in quiet and in speech-shaped background noise. The participants rated the sound quality equal for all take-home strategies. SPS with background noise were equal for all conditions tested. However, SPS in quiet (85% phonemes correct on average with the full array) decreased significantly with increasing spanning distance, with a 3% decrease for each spanned contact.
Shafiro, Valeriy; Sheft, Stanley; Gygi, Brian; Ho, Kim Thien N
Perceptual training with spectrally degraded environmental sounds results in improved environmental sound identification, with benefits shown to extend to untrained speech perception as well. The present study extended those findings to examine longer-term training effects as well as effects of mere repeated exposure to sounds over time. Participants received two pretests (1 week apart) prior to a week-long environmental sound training regimen, which was followed by two posttest sessions, separated by another week without training. Spectrally degraded stimuli, processed with a four-channel vocoder, consisted of a 160-item environmental sound test, word and sentence tests, and a battery of basic auditory abilities and cognitive tests. Results indicated significant improvements in all speech and environmental sound scores between the initial pretest and the last posttest with performance increments following both exposure and training. For environmental sounds (the stimulus class that was trained), the magnitude of positive change that accompanied training was much greater than that due to exposure alone, with improvement for untrained sounds roughly comparable to the speech benefit from exposure. Additional tests of auditory and cognitive abilities showed that speech and environmental sound performance were differentially correlated with tests of spectral and temporal-fine-structure processing, whereas working memory and executive function were correlated with speech, but not environmental sound perception. These findings indicate generalizability of environmental sound training and provide a basis for implementing environmental sound training programs for cochlear implant (CI) patients.
Badin, P; Elisei, F; Bailly, G; Savariaux, C; Serrurier, A; Tarabalka, Y
In the framework of experimental phonetics, our approach to the study of speech production is based on the measurement, the analysis and the modeling of orofacial articulators such as the jaw, the face and the lips, the tongue or the velum. Therefore, we present in this article experimental techniques that allow characterising the shape and movement of speech articulators (static and dynamic MRI, computed tomodensitometry, electromagnetic articulography, video recording). We then describe the linear models of the various organs that we can elaborate from speaker-specific articulatory data. We show that these models, that exhibit a good geometrical resolution, can be controlled from articulatory data with a good temporal resolution and can thus permit the reconstruction of high quality animation of the articulators. These models, that we have integrated in a virtual talking head, can produce augmented audiovisual speech. In this framework, we have assessed the natural tongue reading capabilities of human subjects by means of audiovisual perception tests. We conclude by suggesting a number of other applications of talking heads.
Oryadi-Zanjani, Mohammad Majid; Vahab, Maryam; Bazrafkan, Mozhdeh; Haghjoo, Asghar
The aim of this study was to examine the role of audiovisual speech recognition as a clinical criterion of cochlear implant or hearing aid efficiency in Persian-language children with severe-to-profound hearing loss. This research was administered as a cross-sectional study. The sample size was 60 Persian 5-7 year old children. The assessment tool was one of subtests of Persian version of the Test of Language Development-Primary 3. The study included two experiments: auditory-only and audiovisual presentation conditions. The test was a closed-set including 30 words which were orally presented by a speech-language pathologist. The scores of audiovisual word perception were significantly higher than auditory-only condition in the children with normal hearing (Paudiovisual presentation conditions (P>0.05). The audiovisual spoken word recognition can be applied as a clinical criterion to assess the children with severe to profound hearing loss in order to find whether cochlear implant or hearing aid has been efficient for them or not; i.e. if a child with hearing impairment who using CI or HA can obtain higher scores in audiovisual spoken word recognition than auditory-only condition, his/her auditory skills have appropriately developed due to effective CI or HA as one of the main factors of auditory habilitation. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Fuller, Christina; Free, Rolien; Maat, Bert; Başkent, Deniz
In normal-hearing listeners, musical background has been observed to change the sound representation in the auditory system and produce enhanced performance in some speech perception tests. Based on these observations, it has been hypothesized that musical background can influence sound and speech perception, and as an extension also the quality of life, by cochlear-implant users. To test this hypothesis, this study explored musical background [using the Dutch Musical Background Questionnaire (DMBQ)], and self-perceived sound and speech perception and quality of life [using the Nijmegen Cochlear Implant Questionnaire (NCIQ) and the Speech Spatial and Qualities of Hearing Scale (SSQ)] in 98 postlingually deafened adult cochlear-implant recipients. In addition to self-perceived measures, speech perception scores (percentage of phonemes recognized in words presented in quiet) were obtained from patient records. The self-perceived hearing performance was associated with the objective speech perception. Forty-one respondents (44% of 94 respondents) indicated some form of formal musical training. Fifteen respondents (18% of 83 respondents) judged themselves as having musical training, experience, and knowledge. No association was observed between musical background (quantified by DMBQ), and self-perceived hearing-related performance or quality of life (quantified by NCIQ and SSQ), or speech perception in quiet.
Tanta, Ivan; Lesinger, Gordana
Key place in this paper takes the study of political speech in the Republic of Croatia and their impact on voters, or which keywords are in political speeches and public appearances of politicians in Croatia that their voting body wants to hear. Given listed below we will define the research topic in the form of a question - is there a discrepancy in the perception of the public-political speech in Croatia, and which keywords are specific to the two main regions in Croatia and that inhabitant these regions respond. Marcus Tullius Cicero, the most important Roman orator, he used a specific associative mnemonic technique that is called "technique room". He would talk expound on keywords and conceptual terms that he needed for the desired topic and join in these make them, according to the desired order, in a very creative and unique way, the premises of the house or palace, which he knew well. Then, while holding the speech intended to pass through rooms of the house or palace and then put keywords and concepts come to mind, again according to the desired order. Given that this is a specific kind of research political speech that is relatively recent in Croatia, it should be noted that there is still, this kind of political communication is not sufficiently explored. Particularly the emphasis on the impact and use of keywords specific to the Republic of Croatia, in everyday public and political communication. The paper will be analyzed the political, campaign speeches and promises several winning candidates, and now Croatian MEPs, specific keywords related to: economics, culture, science, education and health. The analysis is based on comparison of the survey results on the representation of key words in the speeches of politicians and qualitative analysis of the speeches of politicians on key words during the election campaign.
Wu, Che-Ming; Liu, Tien-Chen; Wang, Nan-Mai; Chao, Wei-Chieh
(1) To understand speech perception and communication ability through real telephone calls by Mandarin-speaking children with cochlear implants and compare them to live-voice perception, (2) to report the general condition of telephone use of this population, and (3) to investigate the factors that correlate with telephone speech perception performance. Fifty-six children with over 4 years of implant use (aged 6.8-13.6 years, mean duration 8.0 years) took three speech perception tests administered using telephone and live voice to examine sentence, monosyllabic-word and Mandarin tone perception. The children also filled out a questionnaire survey investigating everyday telephone use. Wilcoxon signed-rank test was used to compare the scores between live-voice and telephone tests, and Pearson's test to examine the correlation between them. The mean scores were 86.4%, 69.8% and 70.5% respectively for sentence, word and tone recognition over the telephone. The corresponding live-voice mean scores were 94.3%, 84.0% and 70.8%. Wilcoxon signed-rank test showed the sentence and word scores were significantly different between telephone and live voice test, while the tone recognition scores were not, indicating tone perception was less worsened by telephone transmission than words and sentences. Spearman's test showed that chronological age and duration of implant use were weakly correlated with the perception test scores. The questionnaire survey showed 78% of the children could initiate phone calls and 59% could use the telephone 2 years after implantation. Implanted children are potentially capable of using the telephone 2 years after implantation, and communication ability over the telephone becomes satisfactory 4 years after implantation. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Ross, Lars A; Del Bene, Victor A; Molholm, Sophie; Jae Woo, Young; Andrade, Gizely N; Abrahams, Brett S; Foxe, John J
Three lines of evidence motivated this study. 1) CNTNAP2 variation is associated with autism risk and speech-language development. 2) CNTNAP2 variations are associated with differences in white matter (WM) tracts comprising the speech-language circuitry. 3) Children with autism show impairment in multisensory speech perception. Here, we asked whether an autism risk-associated CNTNAP2 single nucleotide polymorphism in neurotypical adults was associated with multisensory speech perception performance, and whether such a genotype-phenotype association was mediated through white matter tract integrity in speech-language circuitry. Risk genotype at rs7794745 was associated with decreased benefit from visual speech and lower fractional anisotropy (FA) in several WM tracts (right precentral gyrus, left anterior corona radiata, right retrolenticular internal capsule). These structural connectivity differences were found to mediate the effect of genotype on audiovisual speech perception, shedding light on possible pathogenic pathways in autism and biological sources of inter-individual variation in audiovisual speech processing in neurotypicals. Copyright © 2017 Elsevier Inc. All rights reserved.
in a manner that allowed the subjective audiovisual evaluation of loudspeakers under controlled conditions. Additionally, unimodal audio and visual evaluations were used as a baseline for comparison. The same procedure was applied in the investigation of the validity of less than optimal stimuli presentations...
Physiology Teacher, 1976
Lists and reviews recent audiovisual materials in areas of medical, dental, nursing and allied health, and veterinary medicine; undergraduate, and high school studies. Each is classified as to level, type of instruction, usefulness, and source of availability. Topics include respiration, renal physiology, muscle mechanics, anatomy, evolution,…
sounds, has found that both normal-hearing and hearing-impaired listeners prefer loud sounds to be closer to the most comfortable loudness-level, than suggested by common non-linear fitting rules. During this project, two listening experiments were carried out. In the first experiment, hearing aid users......Hearing aid processing of loud speech and noise signals: Consequences for loudness perception and listening comfort. Sound processing in hearing aids is determined by the fitting rule. The fitting rule describes how the hearing aid should amplify speech and sounds in the surroundings......, such that they become audible again for the hearing impaired person. The general goal is to place all sounds within the hearing aid users’ audible range, such that speech intelligibility and listening comfort become as good as possible. Amplification strategies in hearing aids are in many cases based on empirical...
Başkent, Deniz; Fuller, Christina; Galvin, John; Schepel, Like; Gaudrain, Etienne; Free, Rolien
In adult normal-hearing musicians, perception of music, vocal emotion, and speech in noise has been previously shown to be better than non-musicians, sometimes even with spectro-temporally degraded stimuli. In this study, melodic contour identification, vocal emotion identification, and speech
Millman, Rebecca E.; Mattys, Sven L.
Purpose: Background noise can interfere with our ability to understand speech. Working memory capacity (WMC) has been shown to contribute to the perception of speech in modulated noise maskers. WMC has been assessed with a variety of auditory and visual tests, often pertaining to different components of working memory. This study assessed the…
Fuller, Christina; Free, Rolien; Maat, Bert; Baskent, Deniz
In normal-hearing listeners, musical background has been observed to change the sound representation in the auditory system and produce enhanced performance in some speech perception tests. Based on these observations, it has been hypothesized that musical background can influence sound and speech
Marquezin, Daniela Maria Santos Serrano; Viola, Izabel; Ghirardi, Ana Carolina de Assis Moura; Madureira, Sandra; Ferreira, Léslie Piccolotto
To analyze speech expressiveness in a group of executives based on perceptive and acoustic aspects of vocal dynamics. Four male subjects participated in the research study (S1, S2, S3, and S4). The assessments included the Kingdomality test to obtain the keywords of communicative attitudes; perceptive-auditory assessment to characterize vocal quality and dynamics, performed by three judges who are speech language pathologists; perceptiveauditory assessment to judge the chosen keywords; speech acoustics to assess prosodic elements (Praat software); and a statistical analysis. According to the perceptive-auditory analysis of vocal dynamics, S1, S2, S3, and S4 did not show vocal alterations and all of them were considered with lowered habitual pitch. S1: pointed out as insecure, nonobjective, nonempathetic, and unconvincing with inappropriate use of pauses that are mainly formed by hesitations; inadequate separation of prosodic groups with breaking of syntagmatic constituents. S2: regular use of pauses for respiratory reload, organization of sentences, and emphasis, which is considered secure, little objective, empathetic, and convincing. S3: pointed out as secure, objective, empathetic, and convincing with regular use of pauses for respiratory reload and organization of sentences and hesitations. S4: the most secure, objective, empathetic, and convincing, with proper use of pauses for respiratory reload, planning, and emphasis; prosodic groups agreed with the statement, without separating the syntagmatic constituents. The speech characteristics and communicative attitudes were highlighted in two subjects in a different manner, in such a way that the slow rate of speech and breaks of the prosodic groups transmitted insecurity, little objectivity, and nonpersuasion.
Evans, Samuel; Davis, Matthew H
How humans extract the identity of speech sounds from highly variable acoustic signals remains unclear. Here, we use searchlight representational similarity analysis (RSA) to localize and characterize neural representations of syllables at different levels of the hierarchically organized temporo-frontal pathways for speech perception. We asked participants to listen to spoken syllables that differed considerably in their surface acoustic form by changing speaker and degrading surface acoustics using noise-vocoding and sine wave synthesis while we recorded neural responses with functional magnetic resonance imaging. We found evidence for a graded hierarchy of abstraction across the brain. At the peak of the hierarchy, neural representations in somatomotor cortex encoded syllable identity but not surface acoustic form, at the base of the hierarchy, primary auditory cortex showed the reverse. In contrast, bilateral temporal cortex exhibited an intermediate response, encoding both syllable identity and the surface acoustic form of speech. Regions of somatomotor cortex associated with encoding syllable identity in perception were also engaged when producing the same syllables in a separate session. These findings are consistent with a hierarchical account of how variable acoustic signals are transformed into abstract representations of the identity of speech sounds. © The Author 2015. Published by Oxford University Press.
Guediche, Sara; Fiez, Julie A; Holt, Lori L
When listeners encounter speech under adverse listening conditions, adaptive adjustments in perception can improve comprehension over time. In some cases, these adaptive changes require the presence of external information that disambiguates the distorted speech signals, whereas in other cases mere exposure is sufficient. Both external (e.g., written feedback) and internal (e.g., prior word knowledge) sources of information can be used to generate predictions about the correct mapping of a distorted speech signal. We hypothesize that these predictions provide a basis for determining the discrepancy between the expected and actual speech signal that can be used to guide adaptive changes in perception. This study provides the first empirical investigation that manipulates external and internal factors through (a) the availability of explicit external disambiguating information via the presence or absence of postresponse orthographic information paired with a repetition of the degraded stimulus, and (b) the accuracy of internally generated predictions; an acoustic distortion is introduced either abruptly or incrementally. The results demonstrate that the impact of external information on adaptive plasticity is contingent upon whether the intelligibility of the stimuli permits accurate internally generated predictions during exposure. External information sources enhance adaptive plasticity only when input signals are severely degraded and cannot reliably access internal predictions. This is consistent with a computational framework for adaptive plasticity in which error-driven supervised learning relies on the ability to compute sensory prediction error signals from both internal and external sources of information. (PsycINFO Database Record (c) 2016 APA, all rights reserved).
Varnet, Léo; Wang, Tianyun; Peter, Chloe; Meunier, Fanny; Hoen, Michel
It is now well established that extensive musical training percolates to higher levels of cognition, such as speech processing. However, the lack of a precise technique to investigate the specific listening strategy involved in speech comprehension has made it difficult to determine how musicians' higher performance in non-speech tasks contributes to their enhanced speech comprehension. The recently developed Auditory Classification Image approach reveals the precise time-frequency regions used by participants when performing phonemic categorizations in noise. Here we used this technique on 19 non-musicians and 19 professional musicians. We found that both groups used very similar listening strategies, but the musicians relied more heavily on the two main acoustic cues, at the first formant onset and at the onsets of the second and third formants onsets. Additionally, they responded more consistently to stimuli. These observations provide a direct visualization of auditory plasticity resulting from extensive musical training and shed light on the level of functional transfer between auditory processing and speech perception.
Roessler, Abeba, 1981-
The present dissertation set out to investigate how the linguistic environment affects speech perception. Three sets of studies have explored effects of bilingualism on word recognition in adults and infants and the impact of first language linguistic knowledge on rule learning in adults. In the present work, we have found evidence in three auditory priming studies that bilingual adults, in contrast to monolinguals have developed mechanisms to effectively overcome interference from irrelevant...
Recent models on speech perception propose a dual stream processing network, with a dorsal stream, extending from the posterior temporal lobe of the left hemisphere through inferior parietal areas into the left inferior frontal gyrus, and a ventral stream that is assumed to originate in the primary auditory cortex in the upper posterior part of the temporal lobe and to extend towards the anterior part of the temporal lobe, where it may connect to the ventral part of the inferior frontal gyrus...
Recent models on speech perception propose a dual-stream processing network, with a dorsal stream, extending from the posterior temporal lobe of the left hemisphere through inferior parietal areas into the left inferior frontal gyrus, and a ventral stream that is assumed to originate in the primary auditory cortex in the upper posterior part of the temporal lobe and to extend toward the anterior part of the temporal lobe, where it may connect to the ventral part of the inferior frontal gyrus....
Full Text Available In obstetrics postoperative cognitive dysfunctions may take place after caesarean section and vaginal delivery with poor results both for mother and child. The goal was to study influence of anesthesia techniques following caesarian section on memory, perception and speech. Having agreed with local ethics committee and obtained informed consent depending on anesthesia method, pregnant women were divided into 2 groups: 1st group (n=31 had spinal anesthesia, 2nd group (n=34 – total intravenous anesthesia. Spinal anesthesia: 1.8-2.2 mLs of hyperbaric 0.5% bupivacaine. ТIVА: Thiopental sodium (4 mgs kg-1, succinylcholine (1-1.5 mgs kg-1. Phentanyl (10-5-3 µgs kg-1 hour and Diazepam (10 mgs were used after newborn extraction. We used Luria’s test for memory assessment, perception was studied by test “recognition of time”. Speech was studied by test "name of fingers". Control points: 1 - before the surgery, 2 - in 24h after the caesarian section, 3 - on day 3 after surgery, 4 - at discharge from hospital (5-7th day. The study showed that initially decreased memory level in expectant mothers regressed along with the time after caesarean section. Memory is restored in 3 days after surgery regardless of anesthesia techniques. In spinal anesthesia on 5-7th postoperative day memory level exceeds that of used in total intravenous anesthesia. The perception and speech do not depend on the term of postoperative period. Anesthesia technique does not influence perception and speech restoration after caesarean sections.
Ten Oever, Sanne; Sack, Alexander T; Wheat, Katherine L; Bien, Nina; van Atteveldt, Nienke
Content and temporal cues have been shown to interact during audio-visual (AV) speech identification. Typically, the most reliable unimodal cue is used more strongly to identify specific speech features; however, visual cues are only used if the AV stimuli are presented within a certain temporal window of integration (TWI). This suggests that temporal cues denote whether unimodal stimuli belong together, that is, whether they should be integrated. It is not known whether temporal cues also provide information about the identity of a syllable. Since spoken syllables have naturally varying AV onset asynchronies, we hypothesize that for suboptimal AV cues presented within the TWI, information about the natural AV onset differences can aid in speech identification. To test this, we presented low-intensity auditory syllables concurrently with visual speech signals, and varied the stimulus onset asynchronies (SOA) of the AV pair, while participants were instructed to identify the auditory syllables. We revealed that specific speech features (e.g., voicing) were identified by relying primarily on one modality (e.g., auditory). Additionally, we showed a wide window in which visual information influenced auditory perception, that seemed even wider for congruent stimulus pairs. Finally, we found a specific response pattern across the SOA range for syllables that were not reliably identified by the unimodal cues, which we explained as the result of the use of natural onset differences between AV speech signals. This indicates that temporal cues not only provide information about the temporal integration of AV stimuli, but additionally convey information about the identity of AV pairs. These results provide a detailed behavioral basis for further neuro-imaging and stimulation studies to unravel the neurofunctional mechanisms of the audio-visual-temporal interplay within speech perception.
Full Text Available Event-related potential (ERP evidence demonstrates that preschool-aged children selectively attend to informative moments such as word onsets during speech perception. Although this observation indicates a role for attention in language processing, it is unclear whether this type of attention is part of basic speech perception mechanisms, higher-level language skills, or general cognitive abilities. The current study examined these possibilities by measuring ERPs from 5-year-old children listening to a narrative containing attention probes presented before, during, and after word onsets as well as at random control times. Children also completed behavioral tests assessing verbal and nonverbal skills. Probes presented after word onsets elicited a more negative ERP response beginning around 100 ms after probe onset than control probes, indicating increased attention to word-initial segments. Crucially, the magnitude of this difference was correlated with performance on verbal tasks, but showed no relationship to nonverbal measures. More specifically, ERP attention effects were most strongly correlated with performance on a complex metalinguistic task involving grammaticality judgments. These results demonstrate that effective allocation of attention during speech perception supports higher-level, controlled language processing in children by allowing them to focus on relevant information at individual word and complex sentence levels.
Keane, Brian P.; Rosenthal, Orna; Chun, Nicole H.; Shams, Ladan
Autism involves various perceptual benefits and deficits, but it is unclear if the disorder also involves anomalous audiovisual integration. To address this issue, we compared the performance of high-functioning adults with autism and matched controls on experiments investigating the audiovisual integration of speech, spatiotemporal relations, and…
Full Text Available The present study attempts to investigate Indonesian EFL teachersâ€™ and native English speakersâ€™ perceptions of mispronunciations of English sounds by Indonesian EFL learners. For this purpose, a paper-form questionnaire consisting of 32 target mispronunciations was distributed to Indonesian secondary school teachers of English and also to native English speakers. An analysis of the respondentsâ€™ perceptions has discovered that 14 out of the 32 target mispronunciations are pedagogically significant in pronunciation instruction. A further analysis of the reasons for these major mispronunciations has reconfirmed the prevalence of interference of learnersâ€™ native language in their English pronunciation as a major cause of mispronunciations. It has also revealed Indonesian EFL teachersâ€™ tendency to overestimate the seriousness of their learnersâ€™ pronunciations. Based on these findings, the study makes suggestions for better English pronunciation teaching in Indonesia or other EFL countries.
Millman, Rebecca E; Mattys, Sven L
Background noise can interfere with our ability to understand speech. Working memory capacity (WMC) has been shown to contribute to the perception of speech in modulated noise maskers. WMC has been assessed with a variety of auditory and visual tests, often pertaining to different components of working memory. This study assessed the relationship between speech perception in modulated maskers and components of auditory verbal working memory (AVWM) over a range of signal-to-noise ratios. Speech perception in noise and AVWM were measured in 30 listeners (age range 31-67 years) with normal hearing. AVWM was estimated using forward digit recall, backward digit recall, and nonword repetition. After controlling for the effects of age and average pure-tone hearing threshold, speech perception in modulated maskers was related to individual differences in the phonological component of working memory (as assessed by nonword repetition) but only in the least favorable signal-to-noise ratio. The executive component of working memory (as assessed by backward digit) was not predictive of speech perception in any conditions. AVWM is predictive of the ability to benefit from temporal dips in modulated maskers: Listeners with greater phonological WMC are better able to correctly identify sentences in modulated noise backgrounds.
Full Text Available Background: Previous studies have shown that interaural-time-difference (ITD training can improve localization ability. Surprisingly little is, however, known about localization training vis-à-vis speech perception in noise based on interaural time difference in the envelope (ITD ENV. We sought to investigate the reliability of an ITD ENV-based training program in speech-in-noise perception among elderly individuals with normal hearing and speech-in-noise disorder. Methods: The present interventional study was performed during 2016. Sixteen elderly men between 55 and 65 years of age with the clinical diagnosis of normal hearing up to 2000 Hz and speech-in-noise perception disorder participated in this study. The training localization program was based on changes in ITD ENV. In order to evaluate the reliability of the training program, we performed speech-in-noise tests before the training program, immediately afterward, and then at 2 months’ follow-up. The reliability of the training program was analyzed using the Friedman test and the SPSS software. Results: Significant statistical differences were shown in the mean scores of speech-in-noise perception between the 3 time points (P=0.001. The results also indicated no difference in the mean scores of speech-in-noise perception between the 2 time points of immediately after the training program and 2 months’ follow-up (P=0.212. Conclusion: The present study showed the reliability of an ITD ENV-based localization training in elderly individuals with speech-in-noise perception disorder.
Full Text Available The relationship between the neurobiology of speech and music has been investigated for more than a century. There remains no widespread agreement regarding how (or to what extent music perception utilizes the neural circuitry that is engaged in speech processing, particularly at the cortical level. Prominent models such as Patel’s Shared Syntactic Integration Resource Hypothesis (SSIRH and Koelsch’s neurocognitive model of music perception suggest a high degree of overlap, particularly in the frontal lobe, but also perhaps more distinct representations in the temporal lobe with hemispheric asymmetries. The present meta-analysis study used activation likelihood estimate analyses to identify the brain regions consistently activated for music as compared to speech across the functional neuroimaging (fMRI and PET literature. Eighty music and 91 speech neuroimaging studies of healthy adult control subjects were analyzed. Peak activations reported in the music and speech studies were divided into four paradigm categories: passive listening, discrimination tasks, error/anomaly detection tasks and memory-related tasks. We then compared activation likelihood estimates within each category for music versus speech, and each music condition with passive listening. We found that listening to music and to speech preferentially activate distinct temporo-parietal bilateral cortical networks. We also found music and speech to have shared resources in the left pars opercularis but speech-specific resources in the left pars triangularis. The extent to which music recruited speech-activated frontal resources was modulated by task. While there are certainly limitations to meta-analysis techniques particularly regarding sensitivity, this work suggests that the extent of shared resources between speech and music may be task-dependent and highlights the need to consider how task effects may be affecting conclusions regarding the neurobiology of speech and music.
LaCroix, Arianna N.; Diaz, Alvaro F.; Rogalsky, Corianne
The relationship between the neurobiology of speech and music has been investigated for more than a century. There remains no widespread agreement regarding how (or to what extent) music perception utilizes the neural circuitry that is engaged in speech processing, particularly at the cortical level. Prominent models such as Patel's Shared Syntactic Integration Resource Hypothesis (SSIRH) and Koelsch's neurocognitive model of music perception suggest a high degree of overlap, particularly in the frontal lobe, but also perhaps more distinct representations in the temporal lobe with hemispheric asymmetries. The present meta-analysis study used activation likelihood estimate analyses to identify the brain regions consistently activated for music as compared to speech across the functional neuroimaging (fMRI and PET) literature. Eighty music and 91 speech neuroimaging studies of healthy adult control subjects were analyzed. Peak activations reported in the music and speech studies were divided into four paradigm categories: passive listening, discrimination tasks, error/anomaly detection tasks and memory-related tasks. We then compared activation likelihood estimates within each category for music vs. speech, and each music condition with passive listening. We found that listening to music and to speech preferentially activate distinct temporo-parietal bilateral cortical networks. We also found music and speech to have shared resources in the left pars opercularis but speech-specific resources in the left pars triangularis. The extent to which music recruited speech-activated frontal resources was modulated by task. While there are certainly limitations to meta-analysis techniques particularly regarding sensitivity, this work suggests that the extent of shared resources between speech and music may be task-dependent and highlights the need to consider how task effects may be affecting conclusions regarding the neurobiology of speech and music. PMID:26321976
Scott, Sophie K.; Rosen, Stuart; Wickham, Lindsay; Wise, Richard J. S.
Positron emission tomography (PET) was used to investigate the neural basis of the comprehension of speech in unmodulated noise (``energetic'' masking, dominated by effects at the auditory periphery), and when presented with another speaker (``informational'' masking, dominated by more central effects). Each type of signal was presented at four different signal-to-noise ratios (SNRs) (+3, 0, -3, -6 dB for the speech-in-speech, +6, +3, 0, -3 dB for the speech-in-noise), with listeners instructed to listen for meaning to the target speaker. Consistent with behavioral studies, there was SNR-dependent activation associated with the comprehension of speech in noise, with no SNR-dependent activity for the comprehension of speech-in-speech (at low or negative SNRs). There was, in addition, activation in bilateral superior temporal gyri which was associated with the informational masking condition. The extent to which this activation of classical ``speech'' areas of the temporal lobes might delineate the neural basis of the informational masking is considered, as is the relationship of these findings to the interfering effects of unattended speech and sound on more explicit working memory tasks. This study is a novel demonstration of candidate neural systems involved in the perception of speech in noisy environments, and of the processing of multiple speakers in the dorso-lateral temporal lobes.
Oh, Soo Hee; Donaldson, Gail S; Kong, Ying-Yee
Low-frequency acoustic cues have been shown to enhance speech perception by cochlear-implant users, particularly when target speech occurs in a competing background. The present study examined the extent to which a continuous representation of low-frequency harmonicity cues contributes to bimodal benefit in simulated bimodal listeners. Experiment 1 examined the benefit of restoring a continuous temporal envelope to the low-frequency ear while the vocoder ear received a temporally interrupted stimulus. Experiment 2 examined the effect of providing continuous harmonicity cues in the low-frequency ear as compared to restoring a continuous temporal envelope in the vocoder ear. Findings indicate that bimodal benefit for temporally interrupted speech increases when continuity is restored to either or both ears. The primary benefit appears to stem from the continuous temporal envelope in the low-frequency region providing additional phonetic cues related to manner and F1 frequency; a secondary contribution is provided by low-frequency harmonicity cues when a continuous representation of the temporal envelope is present in the low-frequency, or both ears. The continuous temporal envelope and harmonicity cues of low-frequency speech are thought to support bimodal benefit by facilitating identification of word and syllable boundaries, and by restoring partial phonetic cues that occur during gaps in the temporally interrupted stimulus.
Gow, David W; Segawa, Jennifer A
The inherent confound between the organization of articulation and the acoustic-phonetic structure of the speech signal makes it exceptionally difficult to evaluate the competing claims of motor and acoustic-phonetic accounts of how listeners recognize coarticulated speech. Here we use Granger causation analyzes of high spatiotemporal resolution neural activation data derived from the integration of magnetic resonance imaging, magnetoencephalography and electroencephalography, to examine the role of lexical and articulatory mediation in listeners' ability to use phonetic context to compensate for place assimilation. Listeners heard two-word phrases such as pen pad and then saw two pictures, from which they had to select the one that depicted the phrase. Assimilation, lexical competitor environment and the phonological validity of assimilation context were all manipulated. Behavioral data showed an effect of context on the interpretation of assimilated segments. Analysis of 40 Hz gamma phase locking patterns identified a large distributed neural network including 16 distinct regions of interest (ROIs) spanning portions of both hemispheres in the first 200 ms of post-assimilation context. Granger analyzes of individual conditions showed differing patterns of causal interaction between ROIs during this interval, with hypothesized lexical and articulatory structures and pathways driving phonetic activation in the posterior superior temporal gyrus in assimilation conditions, but not in phonetically unambiguous conditions. These results lend strong support for the motor theory of speech perception, and clarify the role of lexical mediation in the phonetic processing of assimilated speech.
Voncken, Marisol J; Bögels, Susan M
Cognitive models emphasize that patients with social anxiety disorder (SAD) are mainly characterized by biased perception of their social performance. In addition, there is a growing body of evidence showing that SAD patients suffer from actual deficits in social interaction. To unravel what characterizes SAD patients the most, underestimation of social performance (defined as the discrepancy between self-perceived and observer-perceived social performance), or actual (observer-perceived) social performance, 48 patients with SAD and 27 normal control participants were observed during a speech and conversation. Consistent with the cognitive model of SAD, patients with SAD underestimated their social performance relative to control participants during the two interactions, but primarily during the speech. Actual social performance deficits were clearly apparent in the conversation but not in the speech. In conclusion, interactions that pull for more interpersonal skills, like a conversation, elicit more actual social performance deficits whereas, situations with a performance character, like a speech, bring about more cognitive distortions in patients with SAD.
Sumner, Meghan; Kim, Seung Kyung; King, Ed; McGowan, Kevin B
Spoken words are highly variable. A single word may never be uttered the same way twice. As listeners, we regularly encounter speakers of different ages, genders, and accents, increasing the amount of variation we face. How listeners understand spoken words as quickly and adeptly as they do despite this variation remains an issue central to linguistic theory. We propose that learned acoustic patterns are mapped simultaneously to linguistic representations and to social representations. In doing so, we illuminate a paradox that results in the literature from, we argue, the focus on representations and the peripheral treatment of word-level phonetic variation. We consider phonetic variation more fully and highlight a growing body of work that is problematic for current theory: words with different pronunciation variants are recognized equally well in immediate processing tasks, while an atypical, infrequent, but socially idealized form is remembered better in the long-term. We suggest that the perception of spoken words is socially weighted, resulting in sparse, but high-resolution clusters of socially idealized episodes that are robust in immediate processing and are more strongly encoded, predicting memory inequality. Our proposal includes a dual-route approach to speech perception in which listeners map acoustic patterns in speech to linguistic and social representations in tandem. This approach makes novel predictions about the extraction of information from the speech signal, and provides a framework with which we can ask new questions. We propose that language comprehension, broadly, results from the integration of both linguistic and social information.
Trude, Alison M; Duff, Melissa C; Brown-Schmidt, Sarah
A hallmark of human speech perception is the ability to comprehend speech quickly and effortlessly despite enormous variability across talkers. However, current theories of speech perception do not make specific claims about the memory mechanisms involved in this process. To examine whether declarative memory is necessary for talker-specific learning, we tested the ability of amnesic patients with severe declarative memory deficits to learn and distinguish the accents of two unfamiliar talkers by monitoring their eye-gaze as they followed spoken instructions. Analyses of the time-course of eye fixations showed that amnesic patients rapidly learned to distinguish these accents and tailored perceptual processes to the voice of each talker. These results demonstrate that declarative memory is not necessary for this ability and points to the involvement of non-declarative memory mechanisms. These results are consistent with findings that other social and accommodative behaviors are preserved in amnesia and contribute to our understanding of the interactions of multiple memory systems in the use and understanding of spoken language. Copyright © 2014 Elsevier Ltd. All rights reserved.
Cai, Ting; McPherson, Bradley; Li, Caiwei; Yang, Feng
Conductive hearing loss simulations have attempted to estimate the speech-understanding difficulties of children with otitis media with effusion (OME). However, the validity of this approach has not been evaluated. The research aim of the present study was to investigate whether a simple, frequency-specific, attenuation-based simulation of OME-related hearing loss was able to reflect the actual effects of conductive hearing loss on speech perception. Forty-one school-age children with OME-related hearing loss were recruited. Each child with OME was matched with a same sex and age counterpart with normal hearing to make a participant pair. Pure-tone threshold differences at octave frequencies from 125 to 8000 Hz for every participant pair were used as the simulation attenuation levels for the normal-hearing children. Another group of 41 school-age otologically normal children were recruited as a control group without actual or simulated hearing loss. The Mandarin Hearing in Noise Test was utilized, and sentence recall accuracy at four signal to noise ratios (SNR) considered representative of classroom-listening conditions were derived, as well as reception thresholds for sentences (RTS) in quiet and in noise using adaptive protocols. The speech perception in quiet and in noise of children with simulated OME-related hearing loss was significantly poorer than that of otologically normal children. Analysis showed that RTS in quiet of children with OME-related hearing loss and of children with simulated OME-related hearing loss was significantly correlated and comparable. A repeated-measures analysis suggested that sentence recall accuracy obtained at 5-dB SNR, 0-dB SNR, and -5-dB SNR was similar between children with actual and simulated OME-related hearing loss. However, RTS in noise in children with OME was significantly better than that for children with simulated OME-related hearing loss. The present frequency-specific, attenuation-based simulation method reflected
Millman, Rebecca E; Mattys, Sven L; Gouws, André D; Prendergast, Garreth
Verbal communication in noisy backgrounds is challenging. Understanding speech in background noise that fluctuates in intensity over time is particularly difficult for hearing-impaired listeners with a sensorineural hearing loss (SNHL). The reduction in fast-acting cochlear compression associated with SNHL exaggerates the perceived fluctuations in intensity in amplitude-modulated sounds. SNHL-induced changes in the coding of amplitude-modulated sounds may have a detrimental effect on the ability of SNHL listeners to understand speech in the presence of modulated background noise. To date, direct evidence for a link between magnified envelope coding and deficits in speech identification in modulated noise has been absent. Here, magnetoencephalography was used to quantify the effects of SNHL on phase locking to the temporal envelope of modulated noise (envelope coding) in human auditory cortex. Our results show that SNHL enhances the amplitude of envelope coding in posteromedial auditory cortex, whereas it enhances the fidelity of envelope coding in posteromedial and posterolateral auditory cortex. This dissociation was more evident in the right hemisphere, demonstrating functional lateralization in enhanced envelope coding in SNHL listeners. However, enhanced envelope coding was not perceptually beneficial. Our results also show that both hearing thresholds and, to a lesser extent, magnified cortical envelope coding in left posteromedial auditory cortex predict speech identification in modulated background noise. We propose a framework in which magnified envelope coding in posteromedial auditory cortex disrupts the segregation of speech from background noise, leading to deficits in speech perception in modulated background noise. SIGNIFICANCE STATEMENT People with hearing loss struggle to follow conversations in noisy environments. Background noise that fluctuates in intensity over time poses a particular challenge. Using magnetoencephalography, we demonstrate
Heffner, Christopher C.; Newman, Rochelle S.; Dilley, Laura C.; Idsardi, William J.
Purpose: A new literature has suggested that speech rate can influence the parsing of words quite strongly in speech. The purpose of this study was to investigate differences between younger adults and older adults in the use of context speech rate in word segmentation, given that older adults perceive timing information differently from younger…
Gauvin, Hanna; De Baene, W.; Brass, Marcel; Hartsuiker, Robert
To minimize the number of errors in speech, and thereby facilitate communication, speech is monitored before articulation. It is, however, unclear at which level during speech production monitoring takes place, and what mechanisms are used to detect and correct errors. The present study investigated
Davidson, Lisa S; Skinner, Margaret W; Holstad, Beth A; Fears, Beverly T; Richter, Marie K; Matusofsky, Margaret; Brenner, Christine; Holden, Timothy; Birath, Amy; Kettel, Jerrica L; Scollie, Susan
The purpose of this study was to examine the effects of a wider instantaneous input dynamic range (IIDR) setting on speech perception and comfort in quiet and noise for children wearing the Nucleus 24 implant system and the Freedom speech processor. In addition, children's ability to understand soft and conversational level speech in relation to aided sound-field thresholds was examined. Thirty children (age, 7 to 17 years) with the Nucleus 24 cochlear implant system and the Freedom speech processor with two different IIDR settings (30 versus 40 dB) were tested on the Consonant Nucleus Consonant (CNC) word test at 50 and 60 dB SPL, the Bamford-Kowal-Bench Speech in Noise Test, and a loudness rating task for four-talker speech noise. Aided thresholds for frequency-modulated tones, narrowband noise, and recorded Ling sounds were obtained with the two IIDRs and examined in relation to CNC scores at 50 dB SPL. Speech Intelligibility Indices were calculated using the long-term average speech spectrum of the CNC words at 50 dB SPL measured at each test site and aided thresholds. Group mean CNC scores at 50 dB SPL with the 40 IIDR were significantly higher (p Speech in Noise Test were not significantly different for the two IIDRs. Significantly improved aided thresholds at 250 to 6000 Hz as well as higher Speech Intelligibility Indices afforded improved audibility for speech presented at soft levels (50 dB SPL). These results indicate that an increased IIDR provides improved word recognition for soft levels of speech without compromising comfort of higher levels of speech sounds or sentence recognition in noise.
Full Text Available Behavioral and neuroimaging studies have demonstrated that brain regions involved with speech production also support speech perception, especially under degraded conditions. The premotor cortex has been shown to be active during both observation and execution of action (‘Mirror System’ properties, and may facilitate speech perception by mapping unimodal and multimodal sensory features onto articulatory speech gestures. For this functional magnetic resonance imaging (fMRI study, participants identified vowels produced by a speaker in audio-visual (saw the speaker’s articulating face and heard her voice, visual only (only saw the speaker’s articulating face, and audio only (only heard the speaker’s voice conditions with varying audio signal-to-noise ratios in order to determine the regions of the premotor cortex involved with multisensory and modality specific processing of visual speech gestures. The task was designed so that identification could be made with a high level of accuracy from visual only stimuli to control for task difficulty and differences in intelligibility. The results of the fMRI analysis for visual only and audio-visual conditions showed overlapping activity in inferior frontal gyrus and premotor cortex. The left ventral inferior premotor cortex showed properties of multimodal (audio-visual enhancement with a degraded auditory signal. The left inferior parietal lobule and right cerebellum also showed these properties. The left ventral superior and dorsal premotor cortex did not show this multisensory enhancement effect, but there was greater activity for the visual only over audio-visual conditions in these areas. The results suggest that the inferior regions of the ventral premotor cortex are involved with integrating multisensory information, whereas, more superior and dorsal regions of the premotor cortex are involved with mapping unimodal (in this case visual sensory features of the speech signal with
Rader, Tobias; Adel, Youssef; Fastl, Hugo; Baumann, Uwe
The aim of this study is to simulate speech perception with combined electric-acoustic stimulation (EAS), verify the advantage of combined stimulation in normal-hearing (NH) subjects, and then compare it with cochlear implant (CI) and EAS user results from the authors' previous study. Furthermore, an automatic speech recognition (ASR) system was built to examine the impact of low-frequency information and is proposed as an applied model to study different hypotheses of the combined-stimulation advantage. Signal-detection-theory (SDT) models were applied to assess predictions of subject performance without the need to assume any synergistic effects. Speech perception was tested using a closed-set matrix test (Oldenburg sentence test), and its speech material was processed to simulate CI and EAS hearing. A total of 43 NH subjects and a customized ASR system were tested. CI hearing was simulated by an aurally adequate signal spectrum analysis and representation, the part-tone-time-pattern, which was vocoded at 12 center frequencies according to the MED-EL DUET speech processor. Residual acoustic hearing was simulated by low-pass (LP)-filtered speech with cutoff frequencies 200 and 500 Hz for NH subjects and in the range from 100 to 500 Hz for the ASR system. Speech reception thresholds were determined in amplitude-modulated noise and in pseudocontinuous noise. Previously proposed SDT models were lastly applied to predict NH subject performance with EAS simulations. NH subjects tested with EAS simulations demonstrated the combined-stimulation advantage. Increasing the LP cutoff frequency from 200 to 500 Hz significantly improved speech reception thresholds in both noise conditions. In continuous noise, CI and EAS users showed generally better performance than NH subjects tested with simulations. In modulated noise, performance was comparable except for the EAS at cutoff frequency 500 Hz where NH subject performance was superior. The ASR system showed similar behavior
Le Cocq, Cecile
Workers in noisy industrial environments are often confronted to communication problems. Lost of workers complain about not being able to communicate easily with their coworkers when they wear hearing protectors. In consequence, they tend to remove their protectors, which expose them to the risk of hearing loss. In fact this communication problem is a double one: first the hearing protectors modify one's own voice perception; second they interfere with understanding speech from others. This double problem is examined in this thesis. When wearing hearing protectors, the modification of one's own voice perception is partly due to the occlusion effect which is produced when an earplug is inserted in the car canal. This occlusion effect has two main consequences: first the physiological noises in low frequencies are better perceived, second the perception of one's own voice is modified. In order to have a better understanding of this phenomenon, the literature results are analyzed systematically, and a new method to quantify the occlusion effect is developed. Instead of stimulating the skull with a bone vibrator or asking the subject to speak as is usually done in the literature, it has been decided to excite the buccal cavity with an acoustic wave. The experiment has been designed in such a way that the acoustic wave which excites the buccal cavity does not excite the external car or the rest of the body directly. The measurement of the hearing threshold in open and occluded car has been used to quantify the subjective occlusion effect for an acoustic wave in the buccal cavity. These experimental results as well as those reported in the literature have lead to a better understanding of the occlusion effect and an evaluation of the role of each internal path from the acoustic source to the internal car. The speech intelligibility from others is altered by both the high sound levels of noisy industrial environments and the speech signal attenuation due to hearing
Temporal proximity is one of the key factors determining whether events in different modalities are integrated into a unified percept. Sensitivity to audiovisual temporal asynchrony has been studied in adults in great detail. However, how such sensitivity matures during childhood is poorly understood. We examined perception of audiovisual temporal…
Massaro, Dominic W
I review 2 seminal research reports published in this journal during its second decade more than a century ago. Given psychology's subdisciplines, they would not normally be reviewed together because one involves reading and the other speech perception. The small amount of interaction between these domains might have limited research and theoretical progress. In fact, the 2 early research reports revealed common processes involved in these 2 forms of language processing. Their illustration of the role of Wundt's apperceptive process in reading and speech perception anticipated descriptions of contemporary theories of pattern recognition, such as the fuzzy logical model of perception. Based on the commonalities between reading and listening, one can question why they have been viewed so differently. It is commonly believed that learning to read requires formal instruction and schooling, whereas spoken language is acquired from birth onward through natural interactions with people who talk. Most researchers and educators believe that spoken language is acquired naturally from birth onward and even prenatally. Learning to read, on the other hand, is not possible until the child has acquired spoken language, reaches school age, and receives formal instruction. If an appropriate form of written text is made available early in a child's life, however, the current hypothesis is that reading will also be learned inductively and emerge naturally, with no significant negative consequences. If this proposal is true, it should soon be possible to create an interactive system, Technology Assisted Reading Acquisition, to allow children to acquire literacy naturally.
Gauvin, Hanna S.; Hartsuiker, Robert J.; Huettig, Falk
The Perceptual Loop Theory of speech monitoring assumes that speakers routinely inspect their inner speech. In contrast, Huettig and Hartsuiker (2010) observed that listening to one's own speech during language production drives eye-movements to phonologically related printed words with a similar time-course as listening to someone else's speech does in speech perception experiments. This suggests that speakers use their speech perception system to listen to their own overt speech, but not to their inner speech. However, a direct comparison between production and perception with the same stimuli and participants is lacking so far. The current printed word eye-tracking experiment therefore used a within-subjects design, combining production and perception. Displays showed four words, of which one, the target, either had to be named or was presented auditorily. Accompanying words were phonologically related, semantically related, or unrelated to the target. There were small increases in looks to phonological competitors with a similar time-course in both production and perception. Phonological effects in perception however lasted longer and had a much larger magnitude. We conjecture that this difference is related to a difference in predictability of one's own and someone else's speech, which in turn has consequences for lexical competition in other-perception and possibly suppression of activation in self-perception. PMID:24339809
Han, Yi; Wang, Guoyin; Yang, Yong; He, Kun
Human emotions could be expressed by many bio-symbols. Speech and facial expression are two of them. They are both regarded as emotional information which is playing an important role in human-computer interaction. Based on our previous studies on emotion recognition, an audiovisual emotion recognition system is developed and represented in this paper. The system is designed for real-time practice, and is guaranteed by some integrated modules. These modules include speech enhancement for eliminating noises, rapid face detection for locating face from background image, example based shape learning for facial feature alignment, and optical flow based tracking algorithm for facial feature tracking. It is known that irrelevant features and high dimensionality of the data can hurt the performance of classifier. Rough set-based feature selection is a good method for dimension reduction. So 13 speech features out of 37 ones and 10 facial features out of 33 ones are selected to represent emotional information, and 52 audiovisual features are selected due to the synchronization when speech and video fused together. The experiment results have demonstrated that this system performs well in real-time practice and has high recognition rate. Our results also show that the work in multimodules fused recognition will become the trend of emotion recognition in the future.
Full Text Available Brain-computer interfaces (BCIs are systems that use real-time analysis of neuroimaging data to determine the mental state of their user for purposes such as providing neurofeedback. Here, we investigate the feasibility of a BCI based on speech perception. Multivariate pattern classification methods were applied to single-trial EEG data collected during speech perception by native and non-native speakers. Two principal questions were asked: 1 Can differences in the perceived categories of pairs of phonemes be decoded at the single-trial level? 2 Can these same categorical differences be decoded across participants, within or between native-language groups? Results indicated that classification performance progressively increased with respect to the categorical status (within, boundary or across of the stimulus contrast, and was also influenced by the native language of individual participants. Classifier performance showed strong relationships with traditional event-related potential measures and behavioral responses. The results of the cross-participant analysis indicated an overall increase in average classifier performance when trained on data from all participants (native and non-native. A second cross-participant classifier trained only on data from native speakers led to an overall improvement in performance for native speakers, but a reduction in performance for non-native speakers. We also found that the native language of a given participant could be decoded on the basis of EEG data with accuracy above 80%. These results indicate that electrophysiological responses underlying speech perception can be decoded at the single-trial level, and that decoding performance systematically reflects graded changes in the responses related to the phonological status of the stimuli. This approach could be used in extensions of the BCI paradigm to support perceptual learning during second language acquisition.
Scheperle, Rachel A; Abbas, Paul J
The ability to perceive speech is related to the listener's ability to differentiate among frequencies (i.e., spectral resolution). Cochlear implant (CI) users exhibit variable speech-perception and spectral-resolution abilities, which can be attributed in part to the extent of electrode interactions at the periphery (i.e., spatial selectivity). However, electrophysiological measures of peripheral spatial selectivity have not been found to correlate with speech perception. The purpose of this study was to evaluate auditory processing at the periphery and cortex using both simple and spectrally complex stimuli to better understand the stages of neural processing underlying speech perception. The hypotheses were that (1) by more completely characterizing peripheral excitation patterns than in previous studies, significant correlations with measures of spectral selectivity and speech perception would be observed, (2) adding information about processing at a level central to the auditory nerve would account for additional variability in speech perception, and (3) responses elicited with spectrally complex stimuli would be more strongly correlated with speech perception than responses elicited with spectrally simple stimuli. Eleven adult CI users participated. Three experimental processor programs (MAPs) were created to vary the likelihood of electrode interactions within each participant. For each MAP, a subset of 7 of 22 intracochlear electrodes was activated: adjacent (MAP 1), every other (MAP 2), or every third (MAP 3). Peripheral spatial selectivity was assessed using the electrically evoked compound action potential (ECAP) to obtain channel-interaction functions for all activated electrodes (13 functions total). Central processing was assessed by eliciting the auditory change complex with both spatial (electrode pairs) and spectral (rippled noise) stimulus changes. Speech-perception measures included vowel discrimination and the Bamford-Kowal-Bench Speech
Petridis, Stavros; Nijholt, Antinus; Nijholt, A.; Pantic, M.; Pantic, Maja; Poel, Mannes; Poel, M.; Hondorp, G.H.W.
Previous research on automatic laughter detection has mainly been focused on audio-based detection. In this study we present an audiovisual approach to distinguishing laughter from speech based on temporal features and we show that the integration of audio and visual information leads to improved
Borrie, Stephanie A.; Lansford, Kaitlin L.; Barrett, Tyson S.
Purpose: The perception of rhythm cues plays an important role in recognizing spoken language, especially in adverse listening conditions. Indeed, this has been shown to hold true even when the rhythm cues themselves are dysrhythmic. This study investigates whether expertise in rhythm perception provides a processing advantage for perception…
Dole, Marjorie; Hoen, Michel; Meunier, Fanny
Developmental dyslexia is associated with impaired speech-in-noise perception. The goal of the present research was to further characterize this deficit in dyslexic adults. In order to specify the mechanisms and processing strategies used by adults with dyslexia during speech-in-noise perception, we explored the influence of background type, presenting single target-words against backgrounds made of cocktail party sounds, modulated speech-derived noise or stationary noise. We also evaluated the effect of three listening configurations differing in terms of the amount of spatial processing required. In a monaural condition, signal and noise were presented to the same ear while in a dichotic situation, target and concurrent sound were presented to two different ears, finally in a spatialised configuration, target and competing signals were presented as if they originated from slightly differing positions in the auditory scene. Our results confirm the presence of a speech-in-noise perception deficit in dyslexic adults, in particular when the competing signal is also speech, and when both signals are presented to the same ear, an observation potentially relating to phonological accounts of dyslexia. However, adult dyslexics demonstrated better levels of spatial release of masking than normal reading controls when the background was speech, suggesting that they are well able to rely on denoising strategies based on spatial auditory scene analysis strategies. Copyright © 2012 Elsevier Ltd. All rights reserved.
Heather Raye Dial
When the lexical and sublexical stimuli were matched in discriminability, scores were highly correlated and no individual demonstrated substantially better performance on lexical than sublexical perception (Figures 1a-c. However, when the word discriminations were easier (as in prior studies; e.g., Miceli et al., 1980, patients with impaired syllable discrimination were within the control range on word discrimination (Figure 1d. Finally, digit matching showed no significant relation to perception tasks (e.g., Figure 1e. Moreover, there was a wide range of digit matching spans for patients performing well on speech perception tasks (e.g., > 1.5 on syllable discrimination and digit matching ranging from 3.6 to 6.0. These data fail to support dual route claims, suggesting that lexical processing depends on sublexical perception and suggesting that phonological STM depends on a buffer separate from speech perception mechanisms.
This paper investigates similarities between lexical consonant clusters and CVC sequences differing in the presence or absence of a lexical vowel in speech perception and production in two Portuguese varieties. The frequent high vowel deletion in the European variety (EP) and the realization of intervening vocalic elements between lexical clusters in Brazilian Portuguese (BP) may minimize the contrast between lexical clusters and CVC sequences in the two Portuguese varieties. In order to test this hypothesis we present a perception experiment with 72 participants and a physiological analysis of 3-dimensional movement data from 5 EP and 4 BP speakers. The perceptual results confirmed a gradual confusion of lexical clusters and CVC sequences in EP, which corresponded roughly to the gradient consonantal overlap found in production. © 2015 S. Karger AG, Basel.
Full Text Available Recent models on speech perception propose a dual stream processing network, with a dorsal stream, extending from the posterior temporal lobe of the left hemisphere through inferior parietal areas into the left inferior frontal gyrus, and a ventral stream that is assumed to originate in the primary auditory cortex in the upper posterior part of the temporal lobe and to extend towards the anterior part of the temporal lobe, where it may connect to the ventral part of the inferior frontal gyrus. This article describes and reviews the results from a series of complementary functional magnetic imaging (fMRI studies that aimed to trace the hierarchical processing network for speech comprehension within the left and right hemisphere with a particular focus on the temporal lobe and the ventral stream. As hypothesised, the results demonstrate a bilateral involvement of the temporal lobes in the processing of speech signals. However, an increasing leftward asymmetry was detected from auditory-phonetic to lexico-semantic processing and along the posterior-anterior axis, thus forming a lateralisation gradient. This increasing leftward lateralisation was particularly evident for the left superior temporal sulcus (STS and more anterior parts of the temporal lobe.
Recent models on speech perception propose a dual-stream processing network, with a dorsal stream, extending from the posterior temporal lobe of the left hemisphere through inferior parietal areas into the left inferior frontal gyrus, and a ventral stream that is assumed to originate in the primary auditory cortex in the upper posterior part of the temporal lobe and to extend toward the anterior part of the temporal lobe, where it may connect to the ventral part of the inferior frontal gyrus. This article describes and reviews the results from a series of complementary functional magnetic resonance imaging studies that aimed to trace the hierarchical processing network for speech comprehension within the left and right hemisphere with a particular focus on the temporal lobe and the ventral stream. As hypothesized, the results demonstrate a bilateral involvement of the temporal lobes in the processing of speech signals. However, an increasing leftward asymmetry was detected from auditory-phonetic to lexico-semantic processing and along the posterior-anterior axis, thus forming a "lateralization" gradient. This increasing leftward lateralization was particularly evident for the left superior temporal sulcus and more anterior parts of the temporal lobe.
The work presented in this book focuses on modeling audiovisual quality as perceived by the users of IP-based solutions for video communication like videotelephony. It also extends the current framework for the parametric prediction of audiovisual call quality. The book addresses several aspects related to the quality perception of entire video calls, namely, the quality estimation of the single audio and video modalities in an interactive context, the audiovisual quality integration of these modalities and the temporal pooling of short sample-based quality scores to account for the perceptual quality impact of time-varying degradations.
Higgins, Meaghan C.; Penney, Sarah B.; Robertson, Erin K.
The roles of phonological short-term memory (pSTM) and speech perception in spoken sentence comprehension were examined in an experimental design. Deficits in pSTM and speech perception were simulated through task demands while typically-developing children (N = 71) completed a sentence-picture matching task. Children performed the control,…
Erdener, Dogu; Burnham, Denis
Despite the body of research on auditory-visual speech perception in infants and schoolchildren, development in the early childhood period remains relatively uncharted. In this study, English-speaking children between three and four years of age were investigated for: (i) the development of visual speech perception--lip-reading and visual…
Huettig, Falk; Hartsuiker, Robert J.
Theories of verbal self-monitoring generally assume an internal (pre-articulatory) monitoring channel, but there is debate about whether this channel relies on speech perception or on production-internal mechanisms. Perception-based theories predict that listening to one's own inner speech has similar behavioural consequences as listening to…
Rachovitsas, Dimitrios; Psillas, George; Chatzigiannakidou, Vasiliki; Triaridis, Stefanos; Constantinidis, Jiannis; Vital, Victor
The aim of this study was to assess the speech perception and speech intelligibility outcome after cochlear implantation in children with malformed inner ear and to compare them with a group of congenitally deaf children implantees without inner ear malformation. Six deaf children (five boys and one girl) with inner ear malformations who were implanted and followed in our clinic were included. These children were matched with six implanted children with normal cochlea for age at implantation and duration of cochlear implant use. All subjects were tested with the internationally used battery tests of listening progress profile (LiP), capacity of auditory performance (CAP), and speech intelligibility rating (SIR). A closed and open set word perception test adapted to the Modern Greek language was also used. In the dysplastic group, two children suffered from CHARGE syndrome, another two from mental retardation, and two children grew up in bilingual homes. At least two years after switch-on, the dysplastic group scored mean LiP 62%, CAP 3.8, SIR 2.1, closed-set 61%, and open-set 49%. The children without inner ear dysplasia achieved significantly better scores, except for CAP which this difference was marginally statistically significant (p=0.009 for LiP, p=0.080 for CAP, p=0.041 for SIR, p=0.011 for closed-set, and p=0.006 for open-set tests). All of the implanted children with malformed inner ear showed benefit of auditory perception and speech production. However, the children with inner ear malformation performed less well compared with the children without inner ear dysplasia. This was possibly due to the high proportion of disabilities detected in the dysplastic group, such as CHARGE syndrome and mental retardation. Bilingualism could also be considered as a factor which possibly affects the outcome of implanted children. Therefore, children with malformed inner ear should be preoperatively evaluated for cognitive and developmental delay. In this case
Wang, Hsiao-Lan S; Chen, I-Chen; Chiang, Chun-Han; Lai, Ying-Hui; Tsao, Yu
The current study examined the associations between basic auditory perception, speech prosodic processing, and vocabulary development in Chinese kindergartners, specifically, whether early basic auditory perception may be related to linguistic prosodic processing in Chinese Mandarin vocabulary acquisition. A series of language, auditory, and linguistic prosodic tests were given to 100 preschool children who had not yet learned how to read Chinese characters. The results suggested that lexical tone sensitivity and intonation production were significantly correlated with children's general vocabulary abilities. In particular, tone awareness was associated with comprehensive language development, whereas intonation production was associated with both comprehensive and expressive language development. Regression analyses revealed that tone sensitivity accounted for 36% of the unique variance in vocabulary development, whereas intonation production accounted for 6% of the variance in vocabulary development. Moreover, auditory frequency discrimination was significantly correlated with lexical tone sensitivity, syllable duration discrimination, and intonation production in Mandarin Chinese. Also it provided significant contributions to tone sensitivity and intonation production. Auditory frequency discrimination may indirectly affect early vocabulary development through Chinese speech prosody. © The Author(s) 2016.
Julie A. Bierer
Full Text Available Speech perception among cochlear implant (CI listeners is highly variable. High degrees of channel interaction are associated with poorer speech understanding. Two methods for reducing channel interaction, focusing electrical fields, and deactivating subsets of channels were assessed by the change in vowel and consonant identification scores with different program settings. The main hypotheses were that (a focused stimulation will improve phoneme recognition and (b speech perception will improve when channels with high thresholds are deactivated. To select high-threshold channels for deactivation, subjects’ threshold profiles were processed to enhance the peaks and troughs, and then an exclusion or inclusion criterion based on the mean and standard deviation was used. Low-threshold channels were selected manually and matched in number and apex-to-base distribution. Nine ears in eight adult CI listeners with Advanced Bionics HiRes90k devices were tested with six experimental programs. Two, all-channel programs, (a 14-channel partial tripolar (pTP and (b 14-channel monopolar (MP, and four variable-channel programs, derived from these two base programs, (c pTP with high- and (d low-threshold channels deactivated, and (e MP with high- and (f low-threshold channels deactivated, were created. Across subjects, performance was similar with pTP and MP programs. However, poorer performing subjects (scoring 2. These same subjects showed slightly more benefit with the reduced channel MP programs (5 and 6. Subjective ratings were consistent with performance. These finding suggest that reducing channel interaction may benefit poorer performing CI listeners.
Magimairaj, Beula M; Nagaraj, Naveen K; Benafield, Natalie J
We examined the association between speech perception in noise (SPIN), language abilities, and working memory (WM) capacity in school-age children. Existing studies supporting the Ease of Language Understanding (ELU) model suggest that WM capacity plays a significant role in adverse listening situations. Eighty-three children between the ages of 7 to 11 years participated. The sample represented a continuum of individual differences in attention, memory, and language abilities. All children had normal-range hearing and normal-range nonverbal IQ. Children completed the Bamford-Kowal-Bench Speech-in-Noise Test (BKB-SIN; Etymotic Research, 2005), a selective auditory attention task, and multiple measures of language and WM. Partial correlations (controlling for age) showed significant positive associations among attention, memory, and language measures. However, BKB-SIN did not correlate significantly with any of the other measures. Principal component analysis revealed a distinct WM factor and a distinct language factor. BKB-SIN loaded robustly as a distinct 3rd factor with minimal secondary loading from sentence recall and short-term memory. Nonverbal IQ loaded as a 4th factor. Results did not support an association between SPIN and WM capacity in children. However, in this study, a single SPIN measure was used. Future studies using multiple SPIN measures are warranted. Evidence from the current study supports the use of BKB-SIN as clinical measure of speech perception ability because it was not influenced by variation in children's language and memory abilities. More large-scale studies in school-age children are needed to replicate the proposed role played by WM in adverse listening situations.
Full Text Available Spoken words are highly variable. A single word may never be uttered the same way twice. As listeners, we regularly encounter speakers of different ages, genders, and accents, increasing the amount of variation we face. How listeners understand spoken words as quickly and adeptly as they do despite this variation remains an issue central to linguistic theory. We propose that learned acoustic patterns are mapped simultaneously to linguistic representations and to social representations. In doing so, we illuminate a paradox that results in the literature from, we argue, the focus on representations and the peripheral treatment of word-level phonetic variation. We consider phonetic variation more fully and highlight a growing body of work that is problematic for current theory: Words with different pronunciation variants are recognized equally well in immediate processing tasks, while an atypical, infrequent, but socially-idealized form is remembered better in the long-term. We suggest that the perception of spoken words is socially-weighted, resulting in sparse, but high-resolution clusters of socially-idealized episodes that are robust in immediate processing and are more strongly encoded, predicting memory inequality. Our proposal includes a dual-route approach to speech perception in which listeners map acoustic patterns in speech to linguistic and social representations in tandem. This approach makes novel predictions about the extraction of information from the speech signal, and provides a framework with which we can ask new questions. We propose that language comprehension, broadly, results from the integration of both linguistic and social information.
Dick, Anthony Steven; Goldin-Meadow, Susan; Hasson, Uri; Skipper, Jeremy I; Small, Steven L
Everyday communication is accompanied by visual information from several sources, including co-speech gestures, which provide semantic information listeners use to help disambiguate the speaker's message. Using fMRI, we examined how gestures influence neural activity in brain regions associated with processing semantic information. The BOLD response was recorded while participants listened to stories under three audiovisual conditions and one auditory-only (speech alone) condition. In the first audiovisual condition, the storyteller produced gestures that naturally accompany speech. In the second, the storyteller made semantically unrelated hand movements. In the third, the storyteller kept her hands still. In addition to inferior parietal and posterior superior and middle temporal regions, bilateral posterior superior temporal sulcus and left anterior inferior frontal gyrus responded more strongly to speech when it was further accompanied by gesture, regardless of the semantic relation to speech. However, the right inferior frontal gyrus was sensitive to the semantic import of the hand movements, demonstrating more activity when hand movements were semantically unrelated to the accompanying speech. These findings show that perceiving hand movements during speech modulates the distributed pattern of neural activation involved in both biological motion perception and discourse comprehension, suggesting listeners attempt to find meaning, not only in the words speakers produce, but also in the hand movements that accompany speech.
Donaldson, Gail S; Dawson, Patricia K; Borden, Lamar Z
Previous studies have confirmed that current steering can increase the number of discriminable pitches available to many cochlear implant (CI) users; however, the ability to perceive additional pitches has not been linked to improved speech perception. The primary goals of this study were to determine (1) whether adult CI users can achieve higher levels of spectral cue transmission with a speech processing strategy that implements current steering (Fidelity120) than with a predecessor strategy (HiRes) and, if so, (2) whether the magnitude of improvement can be predicted from individual differences in place-pitch sensitivity. A secondary goal was to determine whether Fidelity120 supports higher levels of speech recognition in noise than HiRes. A within-subjects repeated measures design evaluated speech perception performance with Fidelity120 relative to HiRes in 10 adult CI users. Subjects used the novel strategy (either HiRes or Fidelity120) for 8 wks during the main study; a subset of five subjects used Fidelity120 for three additional months after the main study. Speech perception was assessed for the spectral cues related to vowel F1 frequency, vowel F2 frequency, and consonant place of articulation; overall transmitted information for vowels and consonants; and sentence recognition in noise. Place-pitch sensitivity was measured for electrode pairs in the apical, middle, and basal regions of the implanted array using a psychophysical pitch-ranking task. With one exception, there was no effect of strategy (HiRes versus Fidelity120) on the speech measures tested, either during the main study (N = 10) or after extended use of Fidelity120 (N = 5). The exception was a small but significant advantage for HiRes over Fidelity120 for consonant perception during the main study. Examination of individual subjects' data revealed that 3 of 10 subjects demonstrated improved perception of one or more spectral cues with Fidelity120 relative to HiRes after 8 wks or longer
Full Text Available Foreign-accented speech often presents a challenging listening condition. In addition to deviations from the target speech norms related to the inexperience of the nonnative speaker, listener characteristics may play a role in determining intelligibility levels. We have previously shown that an implicit visual bias for associating East Asian faces and foreignness predicts the listeners’ perceptual ability to process Korean-accented English audiovisual speech (Yi et al., 2013. Here, we examine the neural mechanism underlying the influence of listener bias to foreign faces on speech perception. In a functional magnetic resonance imaging (fMRI study, native English speakers listened to native- and Korean-accented English sentences, with or without faces. The participants’ Asian-foreign association was measured using an implicit association test (IAT, conducted outside the scanner. We found that foreign-accented speech evoked greater activity in the bilateral primary auditory cortices and the inferior frontal gyri, potentially reflecting greater computational demand. Higher IAT scores, indicating greater bias, were associated with increased BOLD response to foreign-accented speech with faces in the primary auditory cortex, the early node for spectrotemporal analysis. We conclude the following: (1 foreign-accented speech perception places greater demand on the neural systems underlying speech perception; (2 face of the talker can exaggerate the perceived foreignness of foreign-accented speech; (3 implicit Asian-foreign association is associated with decreased neural efficiency in early spectrotemporal processing.
Julio Montero Díaz
Full Text Available This article analyzes the possibilities of presenting an audiovisual history in a society in which audiovisual media has progressively gained greater protagonism. We analyze specific cases of films and historical documentaries and we assess the difficulties faced by historians to understand the keys of audiovisual language and by filmmakers to understand and incorporate history into their productions. We conclude that it would not be possible to disseminate history in the western world without audiovisual resources circulated through various types of screens (cinema, television, computer, mobile phone, video games.
Klatte, Maria; Lachmann, Thomas; Meis, Markus
The effects of classroom noise and background speech on speech perception, measured by word-to-picture matching, and listening comprehension, measured by execution of oral instructions, were assessed in first- and third-grade children and adults in a classroom-like setting. For speech perception, in addition to noise, reverberation time (RT) was varied by conducting the experiment in two virtual classrooms with mean RT = 0.47 versus RT = 1.1 s. Children were more impaired than adults by background sounds in both speech perception and listening comprehension. Classroom noise evoked a reliable disruption in children's speech perception even under conditions of short reverberation. RT had no effect on speech perception in silence, but evoked a severe increase in the impairments due to background sounds in all age groups. For listening comprehension, impairments due to background sounds were found in the children, stronger for first- than for third-graders, whereas adults were unaffected. Compared to classroom noise, background speech had a smaller effect on speech perception, but a stronger effect on listening comprehension, remaining significant when speech perception was controlled. This indicates that background speech affects higher-order cognitive processes involved in children's comprehension. Children's ratings of the sound-induced disturbance were low overall and uncorrelated to the actual disruption, indicating that the children did not consciously realize the detrimental effects. The present results confirm earlier findings on the substantial impact of noise and reverberation on children's speech perception, and extend these to classroom-like environmental settings and listening demands closely resembling those faced by children at school.
Full Text Available Recent studies suggest that multisensory integration is enhanced in older adults but it is not known whether this enhancement is solely driven by perceptual processes or affected by cognitive processes. Using the ‘McGurk illusion’, in Experiment 1 we found that audio-visual integration of incongruent audio-visual words was higher in older adults than in younger adults, although the recognition of either audio- or visual-only presented words was the same across groups. In Experiment 2 we tested recall of sentences within which an incongruent audio-visual speech word was embedded. The overall semantic meaning of the sentence was compatible with either one of the unisensory components of the target word and/or with the illusory percept. Older participants recalled more illusory audio-visual words in sentences than younger adults, however, there was no differential effect of word compatibility on recall for the two groups. Our findings suggest that the relatively high susceptibility to the audio-visual speech illusion in older participants is due more to perceptual than cognitive processing.
Language development in infants born very preterm is often compromised. Poor language skills have been described in preschoolers and differences between preterms and full terms, relative to early vocabulary size and morphosyntactical complexity, have also been identified. However, very few data are available concerning early speech perception abilities and their predictive value for later language outcomes. An overview of the results obtained in a prospective study exploring the link between early speech perception abilities and lexical development in the second year of life in a population of very preterm infants (≤32 gestation weeks) is presented. Specifically, behavioral measures relative to (a) native-language recognition and discrimination from a rhythmically distant and a rhythmically close nonfamiliar languages, and (b) monosyllabic word-form segmentation, were obtained and compared to data from full-term infants. Expressive vocabulary at two test ages (12 and 18 months, corrected age for gestation) was measured using the MacArthur Communicative Development Inventory. Behavioral results indicated that differences between preterm and control groups were present, but only evident when task demands were high in terms of language processing, selective attention to relevant information and memory load. When responses could be based on acquired knowledge from accumulated linguistic experience, between-group differences were no longer observed. Critically, while preterm infants responded satisfactorily to the native-language recognition and discrimination tasks, they clearly differed from full-term infants in the more challenging activity of extracting and retaining word-form units from fluent speech, a fundamental ability for starting to building a lexicon. Correlations between results from the language discrimination tasks and expressive vocabulary measures could not be systematically established. However, attention time to novel words in the word segmentation
Locsei, Gusztav; Pedersen, Julie Hefting; Laugesen, Søren
This study investigated the relationship between speech perception performance in spatially complex, lateralized listening scenarios and temporal fine-structure (TFS) coding at low frequencies. Young normal-hearing (NH) and two groups of elderly hearing-impaired (HI) listeners with mild or moderate...... hearing loss above 1.5 kHz participated in the study. Speech reception thresholds (SRTs) were estimated in the presence of either speech-shaped noise, two-, four-, or eight-talker babble played reversed, or a nonreversed two-talker masker. Target audibility was ensured by applying individualized linear...... threshold nor the interaural phase difference threshold tasks showed a correlation with the SRTs or with the amount of masking release due to binaural unmasking, respectively. The results suggest that, although HI listeners with normal hearing thresholds below 1.5 kHz experienced difficulties with speech...
Locsei, Gusztav; Santurette, Sébastien; Dau, Torsten
and two-talker babble in terms of SRTs, HI listeners could utilize ITDs to a similar degree as NH listeners to facilitate the binaural unmasking of speech. A slight difference was observed between the group means when target and maskers were separated from each other by large ITDs, but not when separated...... SRMs are elicited by small ITDs. Speech reception thresholds (SRTs) and SRM due to ITDs were measured over headphones for 10 young NH and 10 older HI listeners, who had normal or close-to-normal hearing below 1.5 kHz. Diotic target sentences were presented in diotic or dichotic speech-shaped noise...... or two-talker babble maskers. In the dichotic conditions, maskers were lateralized by delaying the masker waveforms in the left headphone channel. Multiple magnitudes of masker ITDs were tested in both noise conditions. Although deficits were observed in speech perception abilities in speechshaped noise...
Lassen, N A; Friberg, L
by measuring regional cerebral blood flow CBF after intracarotid Xenon-133 injection are reviewed with emphasis on tests involving auditory perception and speech, and approach allowing to visualize Wernicke and Broca's areas and their contralateral homologues in vivo. The completely atraumatic tomographic CBF...
Scott, Sophie K.
Our understanding of the neurobiological basis for human speech production and perception has benefited from insights from psychology, neuropsychology and neurology. In this overview, I outline some of the ways that functional imaging has added to this knowledge and argue that, as a neuroanatomical tool, functional imaging has led to some…
Westerhausen, René; Bless, Josef J.; Passow, Susanne; Kompus, Kristiina; Hugdahl, Kenneth
The ability to use cognitive-control functions to regulate speech perception is thought to be crucial in mastering developmental challenges, such as language acquisition during childhood or compensation for sensory decline in older age, enabling interpersonal communication and meaningful social interactions throughout the entire life span.…
Most, Tova; Rothem, Hilla; Luntz, Michal
The researchers evaluated the contribution of cochlear implants (CIs) to speech perception by a sample of prelingually deaf individuals implanted after age 8 years. This group was compared with a group with profound hearing impairment (HA-P), and with a group with severe hearing impairment (HA-S), both of which used hearing aids. Words and…
Gou, J.; Smith, J.; Valero, J.; Rubio, I.
This paper reports on a clinical trial evaluating outcomes of a frequency-lowering technique for adolescents and young adults with severe to profound hearing impairment. Outcomes were defined by changes in aided thresholds, speech perception, and acceptance. The participants comprised seven young people aged between 13 and 25 years. They were…
Borrie, Stephanie A
This study investigated the influence of visual speech information on perceptual processing of neurologically degraded speech. Fifty listeners identified spastic dysarthric speech under both audio (A) and audiovisual (AV) conditions. Condition comparisons revealed that the addition of visual speech information enhanced processing of the neurologically degraded input in terms of (a) acuity (percent phonemes correct) of vowels and consonants and (b) recognition (percent words correct) of predictive and nonpredictive phrases. Listeners exploited stress-based segmentation strategies more readily in AV conditions, suggesting that the perceptual benefit associated with adding visual speech information to the auditory signal-the AV advantage-has both segmental and suprasegmental origins. Results also revealed that the magnitude of the AV advantage can be predicted, to some degree, by the extent to which an individual utilizes syllabic stress cues to inform word recognition in AV conditions. Findings inform the development of a listener-specific model of speech perception that applies to processing of dysarthric speech in everyday communication contexts.
Schall, Sonja; von Kriegstein, Katharina
It has been proposed that internal simulation of the talking face of visually-known speakers facilitates auditory speech recognition. One prediction of this view is that brain areas involved in auditory-only speech comprehension interact with visual face-movement sensitive areas, even under auditory-only listening conditions. Here, we test this hypothesis using connectivity analyses of functional magnetic resonance imaging (fMRI) data. Participants (17 normal participants, 17 developmental prosopagnosics) first learned six speakers via brief voice-face or voice-occupation training (comprehension. Overall, the present findings indicate that learned visual information is integrated into the analysis of auditory-only speech and that this integration results from the interaction of task-relevant face-movement and auditory speech-sensitive areas.
Maeda, Yukihide; Takao, Soshi; Sugaya, Akiko; Kataoka, Yuko; Kariya, Shin; Tanaka, Satomi; Nagayasu, Rie; Nakagawa, Atsuko; Nishizaki, Kazunori
To clarify how the pure-tone threshold (PTT) on the PTA predicts speech perception (SP) in elderly Japanese persons. Data on PTT and SP were cross-sectionally analyzed in Japanese persons (656 ears in 353 patients, aged ≥65 years). Correlations of SP and average PTT in all tested frequencies were evaluated by Pearson's correlation coefficient and simple linear regression. After adjusting for sex, laterality of ears, and age, the relationship of average and frequency-specific PTT with impaired SP ≤50% was estimated by logistic regression models. SP correlated well (r = -0.699) with the average PTT of all tested frequencies. On the other hand, the correlation between patient age and SP was weak, especially among ≤85-year-old persons (r = -0.092). Linear regression showed that the average PTT corresponding to SP of 50% was 76.4 dB nHL. Odds ratios for impaired SP were highest for PTT at 2000 Hz. Odds ratios were higher for middle (500, 1000, 2000 Hz) and high frequencies (4000, 8000 Hz) than low frequencies (125, 250 Hz). The PTT on the pure-tone audiogram (PTA) is a good predictor of SP by speech audiometry among older persons, which could provide clinically important information for hearing aid fitting and cochlear implantation.
Anderson, Elizabeth S; Nelson, David A; Kreft, Heather; Nelson, Peggy B; Oxenham, Andrew J
Spectral ripple discrimination thresholds were measured in 15 cochlear-implant users with broadband (350-5600 Hz) and octave-band noise stimuli. The results were compared with spatial tuning curve (STC) bandwidths previously obtained from the same subjects. Spatial tuning curve bandwidths did not correlate significantly with broadband spectral ripple discrimination thresholds but did correlate significantly with ripple discrimination thresholds when the rippled noise was confined to an octave-wide passband, centered on the STC's probe electrode frequency allocation. Ripple discrimination thresholds were also measured for octave-band stimuli in four contiguous octaves, with center frequencies from 500 Hz to 4000 Hz. Substantial variations in thresholds with center frequency were found in individuals, but no general trends of increasing or decreasing resolution from apex to base were observed in the pooled data. Neither ripple nor STC measures correlated consistently with speech measures in noise and quiet in the sample of subjects in this study. Overall, the results suggest that spectral ripple discrimination measures provide a reasonable measure of spectral resolution that correlates well with more direct, but more time-consuming, measures of spectral resolution, but that such measures do not always provide a clear and robust predictor of performance in speech perception tasks. © 2011 Acoustical Society of America
Tye-Murray, Nancy; Spehar, Brent P; Myerson, Joel; Hale, Sandra; Sommers, Mitchell S
Common-coding theory posits that (1) perceiving an action activates the same representations of motor plans that are activated by actually performing that action, and (2) because of individual differences in the ways that actions are performed, observing recordings of one's own previous behavior activates motor plans to an even greater degree than does observing someone else's behavior. We hypothesized that if observing oneself activates motor plans to a greater degree than does observing others, and if these activated plans contribute to perception, then people should be able to lipread silent video clips of their own previous utterances more accurately than they can lipread video clips of other talkers. As predicted, two groups of participants were able to lipread video clips of themselves, recorded more than two weeks earlier, significantly more accurately than video clips of others. These results suggest that visual input activates speech motor activity that links to word representations in the mental lexicon.
Higgins, Meaghan C; Penney, Sarah B; Robertson, Erin K
The roles of phonological short-term memory (pSTM) and speech perception in spoken sentence comprehension were examined in an experimental design. Deficits in pSTM and speech perception were simulated through task demands while typically-developing children (N [Formula: see text] 71) completed a sentence-picture matching task. Children performed the control, simulated pSTM deficit, simulated speech perception deficit, or simulated double deficit condition. On long sentences, the double deficit group had lower scores than the control and speech perception deficit groups, and the pSTM deficit group had lower scores than the control group and marginally lower scores than the speech perception deficit group. The pSTM and speech perception groups performed similarly to groups with real deficits in these areas, who completed the control condition. Overall, scores were lowest on noncanonical long sentences. Results show pSTM has a greater effect than speech perception on sentence comprehension, at least in the tasks employed here.
Approaches to music and audiovisual meaning in film appear to be very different in nature and scope when considered from the point of view of experimental psychology or humanistic studies. Nevertheless, this article argues that experimental studies square with ideas of audiovisual perception...... and meaning in humanistic film music studies in two ways: through studies of vertical synchronous interaction and through studies of horizontal narrative effects. Also, it is argued that the combination of insights from quantitative experimental studies and qualitative audiovisual film analysis may actually...... be combined into a more complex understanding of how audiovisual features interact in the minds of their audiences. This is demonstrated through a review of a series of experimental studies. Yet, it is also argued that textual analysis and concepts from within film and music studies can provide insights...
Isabel Fernandes Silva
Full Text Available Over the last decades, audiovisual translation has gained increased significance in Translation Studies as well as an interdisciplinary subject within other fields (media, cinema studies etc. Although many articles have been published on communicative aspects of translation such as politeness, only recently have scholars taken an interest in the translation of compliments. This study will focus on both these areas from a multimodal and pragmatic perspective, emphasizing the links between these fields and how this multidisciplinary approach will evidence the polysemiotic nature of the translation process. In Audiovisual Translation both text and image are at play, therefore, the translation of speech produced by the characters may either omit (because it is provided by visualgestual signs or it may emphasize information. A selection was made of the compliments present in the film What Women Want, our focus being on subtitles which did not successfully convey the compliment expressed in the source text, as well as analyze the reasons for this, namely difference in register, Culture Specific Items and repetitions. These differences lead to a different portrayal/identity/perception of the main character in the English version (original soundtrack and subtitled versions in Portuguese and Italian.
Full Text Available Activity in premotor and sensorimotor cortices is found in speech production and some perception tasks. Yet, how sensorimotor integration supports these functions is unclear due to a lack of data examining the timing of activity from these regions. Beta (~20Hz and alpha (~10Hz spectral power within the EEG µ rhythm are considered indices of motor and somatosensory activity, respectively. In the current study, perception conditions required discrimination (same/different of syllables pairs (/ba/ and /da/ in quiet and noisy conditions. Production conditions required covert and overt syllable productions and overt word production. Independent component analysis was performed on EEG data obtained during these conditions to 1 identify clusters of µ components common to all conditions and 2 examine real-time event-related spectral perturbations (ERSP within alpha and beta bands. 17 and 15 out of 20 participants produced left and right µ-components, respectively, localized to precentral gyri. Discrimination conditions were characterized by significant (pFDR<.05 early alpha event-related synchronization (ERS prior to and during stimulus presentation and later alpha event-related desynchronization (ERD following stimulus offset. Beta ERD began early and gained strength across time. Differences were found between quiet and noisy discrimination conditions. Both overt syllable and word productions yielded similar alpha/beta ERD that began prior to production and was strongest during muscle activity. Findings during covert production were weaker than during overt production. One explanation for these findings is that µ-beta ERD indexes early predictive coding (e.g., internal modeling and/or overt and covert attentional / motor processes. µ-alpha ERS may index inhibitory input to the premotor cortex from sensory regions prior to and during discrimination, while µ-alpha ERD may index re-afferent sensory feedback during speech rehearsal and production.
Jane, Griselda; Tunjungsari, Harini
Parental involvement in a speech therapy has not been prioritized in most therapy centers in Indonesia. One of the therapy centers that has recognized the importance of parental involvement is Kailila Speech Therapy Center. In Kailila speech therapy center, parental involvement in children's speech therapy is an obligation that has been…
Zaar, Johannes; Dau, Torsten
The present study investigated the influence of various sources of response variability in consonant perception. A distinction was made between sourceinduced variability and receiverrelated variability. The former refers to perceptual differences induced by differences in the ...
Döge, Julia; Baumann, Uwe; Weissgerber, Tobias; Rader, Tobias
To assess auditory localization accuracy and speech reception threshold (SRT) in complex noise conditions in adult patients with acquired single-sided deafness, after intervention with a cochlear implant (CI) in the deaf ear. Nonrandomized, open, prospective patient series. Tertiary referral university hospital. Eleven patients with late-onset single-sided deafness (SSD) and normal hearing in the unaffected ear, who received a CI. All patients were experienced CI users. Unilateral cochlear implantation. Speech perception was tested in a complex multitalker equivalent noise field consisting of multiple sound sources. Speech reception thresholds in noise were determined in aided (with CI) and unaided conditions. Localization accuracy was assessed in complete darkness. Acoustic stimuli were radiated by multiple loudspeakers distributed in the frontal horizontal plane between -60 and +60 degrees. In the aided condition, results show slightly improved speech reception scores compared with the unaided condition in most of the patients. For 8 of the 11 subjects, SRT was improved between 0.37 and 1.70 dB. Three of the 11 subjects showed deteriorations between 1.22 and 3.24 dB SRT. Median localization error decreased significantly by 12.9 degrees compared with the unaided condition. CI in single-sided deafness is an effective treatment to improve the auditory localization accuracy. Speech reception in complex noise conditions is improved to a lesser extent in 73% of the participating CI SSD patients. However, the absence of true binaural interaction effects (summation, squelch) impedes further improvements. The development of speech processing strategies that respect binaural interaction seems to be mandatory to advance speech perception in demanding listening situations in SSD patients.
Russo, Frank A.
The RAVDESS is a validated multimodal database of emotional speech and song. The database is gender balanced consisting of 24 professional actors, vocalizing lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity, with an additional neutral expression. All conditions are available in face-and-voice, face-only, and voice-only formats. The set of 7356 recordings were each rated 10 times on emotional validity, intensity, and genuineness. Ratings were provided by 247 individuals who were characteristic of untrained research participants from North America. A further set of 72 participants provided test-retest data. High levels of emotional validity and test-retest intrarater reliability were reported. Corrected accuracy and composite "goodness" measures are presented to assist researchers in the selection of stimuli. All recordings are made freely available under a Creative Commons license and can be downloaded at https://doi.org/10.5281/zenodo.1188976. PMID:29768426
William F Katz
Full Text Available Pronunciation training studies have yielded important information concerning the processing of audiovisual (AV information. Second language (L2 learners show increased reliance on bottom-up, multimodal input for speech perception (compared to monolingual individuals. However, little is known about the role of viewing one’s own speech articulation processes during speech training. The current study investigated whether real-time, visual feedback for tongue movement can improve a speaker’s learning of non-native speech sounds. An interactive 3D tongue visualization system based on electromagnetic articulography (EMA was used in a speech training experiment. Native speakers of American English produced a novel speech sound (/ɖ̠/; a voiced, coronal, palatal stop before, during, and after trials in which they viewed their own speech movements using the 3D model. Talkers’ productions were evaluated using kinematic (tongue-tip spatial positioning and acoustic (burst spectra measures. The results indicated a rapid gain in accuracy associated with visual feedback training. The findings are discussed with respect to neural models for multimodal speech processing.
Klatte, Maria; Meis, Markus; Sukowski, Helga; Schick, August
The effects of background noise of moderate intensity on short-term storage and processing of verbal information were analyzed in 6 to 8 year old children. In line with adult studies on "irrelevant sound effect" (ISE), serial recall of visually presented digits was severely disrupted by background speech that the children did not understand. Train noises of equal Intensity however, had no effect. Similar results were demonstrated with tasks requiring storage and processing of heard information. Memory for nonwords, execution of oral instructions and categorizing speech sounds were significantly disrupted by irrelevant speech. The affected functions play a fundamental role in the acquisition of spoken and written language. Implications concerning current models of the ISE and the acoustic conditions in schools and kindergardens are discussed.
Stern, Steven E; Chobany, Chelsea M; Beam, Alexander A; Hoover, Brittany N; Hull, Thomas T; Linsenbigler, Melissa; Makdad-Light, Courtney; Rubright, Courtney N
We have previously demonstrated that when speech generating devices (SGD) are used as assistive technologies, they are preferred over the users' natural voices. We sought to examine whether using SGDs would affect listener's perceptions of hirability of people with complex communication needs. In a series of three experiments, participants rated videotaped actors, one using SGD and the other using their natural, mildly dysarthric voice, on (a) a measurement of perceptions of speaker credibility, strength, and informedness and (b) measurements of hirability for jobs coded in terms of skill, verbal ability, and interactivity. Experiment 1 examined hirability for jobs varying in terms of skill and verbal ability. Experiment 2 was a replication that examined hirability for jobs varying in terms of interactivity. Experiment 3 examined jobs in terms of skill and specific mode of interaction (face-to-face, telephone, computer-mediated). Actors were rated more favorably when using SGD than their own voices. Actors using SGD were also rated more favorably for highly skilled and highly verbal jobs. This preference for SGDs over mildly dysarthric voice was also found for jobs entailing computer-mediated-communication, particularly skillful jobs.
Mary Kathryn Abel
Full Text Available Although pitch is a fundamental attribute of auditory perception, substantial individual differences exist in our ability to perceive differences in pitch. Little is known about how these individual differences in the auditory modality might affect crossmodal processes such as audiovisual perception. In this study, we asked whether individual differences in pitch perception might affect audiovisual perception, as it relates to age of onset and number of years of musical training. Fifty-seven subjects made subjective ratings of interval size when given point-light displays of audio, visual, and audiovisual stimuli of sung intervals. Audiovisual stimuli were divided into congruent and incongruent (audiovisual-mismatched stimuli. Participants' ratings correlated strongly with interval size in audio-only, visual-only, and audiovisual-congruent conditions. In the audiovisual-incongruent condition, ratings correlated more with audio than with visual stimuli, particularly for subjects who had better pitch perception abilities and higher nonverbal IQ scores. To further investigate the effects of age of onset and length of musical training, subjects were divided into musically trained and untrained groups. Results showed that among subjects with musical training, the degree to which participants' ratings correlated with auditory interval size during incongruent audiovisual perception was correlated with both nonverbal IQ and age of onset of musical training. After partialing out nonverbal IQ, pitch discrimination thresholds were no longer associated with incongruent audio scores, whereas age of onset of musical training remained associated with incongruent audio scores. These findings invite future research on the developmental effects of musical training, particularly those relating to the process of audiovisual perception.
Abel, Mary Kathryn; Li, H Charles; Russo, Frank A; Schlaug, Gottfried; Loui, Psyche
Although pitch is a fundamental attribute of auditory perception, substantial individual differences exist in our ability to perceive differences in pitch. Little is known about how these individual differences in the auditory modality might affect crossmodal processes such as audiovisual perception. In this study, we asked whether individual differences in pitch perception might affect audiovisual perception, as it relates to age of onset and number of years of musical training. Fifty-seven subjects made subjective ratings of interval size when given point-light displays of audio, visual, and audiovisual stimuli of sung intervals. Audiovisual stimuli were divided into congruent and incongruent (audiovisual-mismatched) stimuli. Participants' ratings correlated strongly with interval size in audio-only, visual-only, and audiovisual-congruent conditions. In the audiovisual-incongruent condition, ratings correlated more with audio than with visual stimuli, particularly for subjects who had better pitch perception abilities and higher nonverbal IQ scores. To further investigate the effects of age of onset and length of musical training, subjects were divided into musically trained and untrained groups. Results showed that among subjects with musical training, the degree to which participants' ratings correlated with auditory interval size during incongruent audiovisual perception was correlated with both nonverbal IQ and age of onset of musical training. After partialing out nonverbal IQ, pitch discrimination thresholds were no longer associated with incongruent audio scores, whereas age of onset of musical training remained associated with incongruent audio scores. These findings invite future research on the developmental effects of musical training, particularly those relating to the process of audiovisual perception.
Emily B. J. Coffey
Full Text Available Speech-in-noise (SIN perception is a complex cognitive skill that affects social, vocational, and educational activities. Poor SIN ability particularly affects young and elderly populations, yet varies considerably even among healthy young adults with normal hearing. Although SIN skills are known to be influenced by top-down processes that can selectively enhance lower-level sound representations, the complementary role of feed-forward mechanisms and their relationship to musical training is poorly understood. Using a paradigm that minimizes the main top-down factors that have been implicated in SIN performance such as working memory, we aimed to better understand how robust encoding of periodicity in the auditory system (as measured by the frequency-following response contributes to SIN perception. Using magnetoencephalograpy, we found that the strength of encoding at the fundamental frequency in the brainstem, thalamus, and cortex is correlated with SIN accuracy. The amplitude of the slower cortical P2 wave was previously also shown to be related to SIN accuracy and FFR strength; we use MEG source localization to show that the P2 wave originates in a temporal region anterior to that of the cortical FFR. We also confirm that the observed enhancements were related to the extent and timing of musicianship. These results are consistent with the hypothesis that basic feed-forward sound encoding affects SIN perception by providing better information to later processing stages, and that modifying this process may be one mechanism through which musical training might enhance the auditory networks that subserve both musical and language functions.
Yu, Jyaehyoung; Jeon, Hanjae; Song, Changgeun; Han, Woojae
The goal of the present study was to develop an auditory training program using a mobile device and to test its efficacy by applying it to older adults suffering from moderate-to-severe sensorineural hearing loss. Among the 20 elderly hearing-impaired listeners who participated, 10 were randomly assigned to a training group (TG) and 10 were assigned to a non-training group (NTG) as a control. As a baseline, all participants were measured by vowel, consonant and sentence tests. In the experiment, the TG had been trained for 4 weeks using a mobile program, which had four levels and consisted of 10 Korean nonsense syllables, with each level completed in 1 week. In contrast, traditional auditory training had been provided for the NTG during the same period. To evaluate whether a training effect was achieved, the two groups also carried out the same tests as the baseline after completing the experiment. The results showed that performance on the consonant and sentence tests in the TG was significantly increased compared with that of the NTG. Also, improved scores of speech perception were retained at 2 weeks after the training was completed. However, vowel scores were not changed after the 4-week training in both the TG and the NTG. This result pattern suggests that a moderate amount of auditory training using the mobile device with cost-effective and minimal supervision is useful when it is used to improve the speech understanding of older adults with hearing loss. Geriatr Gerontol Int 2017; 17: 61-68. © 2015 Japan Geriatrics Society.
Yeend, Ingrid; Beach, Elizabeth Francis; Sharma, Mridula; Dillon, Harvey
Recent animal research has shown that exposure to single episodes of intense noise causes cochlear synaptopathy without affecting hearing thresholds. It has been suggested that the same may occur in humans. If so, it is hypothesized that this would result in impaired encoding of sound and lead to difficulties hearing at suprathreshold levels, particularly in challenging listening environments. The primary aim of this study was to investigate the effect of noise exposure on auditory processing, including the perception of speech in noise, in adult humans. A secondary aim was to explore whether musical training might improve some aspects of auditory processing and thus counteract or ameliorate any negative impacts of noise exposure. In a sample of 122 participants (63 female) aged 30-57 years with normal or near-normal hearing thresholds, we conducted audiometric tests, including tympanometry, audiometry, acoustic reflexes, otoacoustic emissions and medial olivocochlear responses. We also assessed temporal and spectral processing, by determining thresholds for detection of amplitude modulation and temporal fine structure. We assessed speech-in-noise perception, and conducted tests of attention, memory and sentence closure. We also calculated participants' accumulated lifetime noise exposure and administered questionnaires to assess self-reported listening difficulty and musical training. The results showed no clear link between participants' lifetime noise exposure and performance on any of the auditory processing or speech-in-noise tasks. Musical training was associated with better performance on the auditory processing tasks, but not the on the speech-in-noise perception tasks. The results indicate that sentence closure skills, working memory, attention, extended high frequency hearing thresholds and medial olivocochlear suppression strength are important factors that are related to the ability to process speech in noise. Crown Copyright © 2017. Published by
Anderson, Karen L; Goldstein, Howard
Children typically learn in classroom environments that have background noise and reverberation that interfere with accurate speech perception. Amplification technology can enhance the speech perception of students who are hard of hearing. This study used a single-subject alternating treatments design to compare the speech recognition abilities of children who are, hard of hearing when they were using hearing aids with each of three frequency modulated (FM) or infrared devices. Eight 9-12-year-olds with mild to severe hearing loss repeated Hearing in Noise Test (HINT) sentence lists under controlled conditions in a typical kindergarten classroom with a background noise level of +10 dB signal-to-noise (S/N) ratio and 1.1 s reverberation time. Participants listened to HINT lists using hearing aids alone and hearing aids in combination with three types of S/N-enhancing devices that are currently used in mainstream classrooms: (a) FM systems linked to personal hearing aids, (b) infrared sound field systems with speakers placed throughout the classroom, and (c) desktop personal sound field FM systems. The infrared ceiling sound field system did not provide benefit beyond that provided by hearing aids alone. Desktop and personal FM systems in combination with personal hearing aids provided substantial improvements in speech recognition. This information can assist in making S/N-enhancing device decisions for students using hearing aids. In a reverberant and noisy classroom setting, classroom sound field devices are not beneficial to speech perception for students with hearing aids, whereas either personal FM or desktop sound field systems provide listening benefits.
Buchan, Julie N; Munhall, Kevin G
Audiovisual speech perception is an everyday occurrence of multisensory integration. Conflicting visual speech information can influence the perception of acoustic speech (namely the McGurk effect), and auditory and visual speech are integrated over a rather wide range of temporal offsets. This research examined whether the addition of a concurrent cognitive load task would affect the audiovisual integration in a McGurk speech task and whether the cognitive load task would cause more interference at increasing offsets. The amount of integration was measured by the proportion of responses in incongruent trials that did not correspond to the audio (McGurk response). An eye-tracker was also used to examine whether the amount of temporal offset and the presence of a concurrent cognitive load task would influence gaze behavior. Results from this experiment show a very modest but statistically significant decrease in the number of McGurk responses when subjects also perform a cognitive load task, and that this effect is relatively constant across the various temporal offsets. Participant's gaze behavior was also influenced by the addition of a cognitive load task. Gaze was less centralized on the face, less time was spent looking at the mouth and more time was spent looking at the eyes, when a concurrent cognitive load task was added to the speech task.
Taitelbaum-Swead, Riki; Fostick, Leah
Everyday life includes fluctuating noise levels, resulting in continuously changing speech intelligibility. The study aims were: (1) to quantify the amount of decrease in age-related speech perception, as a result of increasing noise level, and (2) to test the effect of age on context usage at the word level (smaller amount of contextual cues). A total of 24 young adults (age 20-30 years) and 20 older adults (age 60-75 years) were tested. Meaningful and nonsense one-syllable consonant-vowel-consonant words were presented with the background noise types of speech noise (SpN), babble noise (BN), and white noise (WN), with a signal-to-noise ratio (SNR) of 0 and -5 dB. Older adults had lower accuracy in SNR = 0, with WN being the most difficult condition for all participants. Measuring the change in speech perception when SNR decreased showed a reduction of 18.6-61.5% in intelligibility, with age effect only for BN. Both young and older adults used less phonemic context with WN, as compared to other conditions. Older adults are more affected by an increasing noise level of fluctuating informational noise as compared to steady-state noise. They also use less contextual cues when perceiving monosyllabic words. Further studies should take into consideration that when presenting the stimulus differently (change in noise level, less contextual cues), other perceptual and cognitive processes are involved. © 2016 S. Karger AG, Basel.
Gifford, René H; Revit, Lawrence J
Although cochlear implant patients are achieving increasingly higher levels of performance, speech perception in noise continues to be problematic. The newest generations of implant speech processors are equipped with preprocessing and/or external accessories that are purported to improve listening in noise. Most speech perception measures in the clinical setting, however, do not provide a close approximation to real-world listening environments. To assess speech perception for adult cochlear implant recipients in the presence of a realistic restaurant simulation generated by an eight-loudspeaker (R-SPACE) array in order to determine whether commercially available preprocessing strategies and/or external accessories yield improved sentence recognition in noise. Single-subject, repeated-measures design with two groups of participants: Advanced Bionics and Cochlear Corporation recipients. Thirty-four subjects, ranging in age from 18 to 90 yr (mean 54.5 yr), participated in this prospective study. Fourteen subjects were Advanced Bionics recipients, and 20 subjects were Cochlear Corporation recipients. Speech reception thresholds (SRTs) in semidiffuse restaurant noise originating from an eight-loudspeaker array were assessed with the subjects' preferred listening programs as well as with the addition of either Beam preprocessing (Cochlear Corporation) or the T-Mic accessory option (Advanced Bionics). In Experiment 1, adaptive SRTs with the Hearing in Noise Test sentences were obtained for all 34 subjects. For Cochlear Corporation recipients, SRTs were obtained with their preferred everyday listening program as well as with the addition of Focus preprocessing. For Advanced Bionics recipients, SRTs were obtained with the integrated behind-the-ear (BTE) mic as well as with the T-Mic. Statistical analysis using a repeated-measures analysis of variance (ANOVA) evaluated the effects of the preprocessing strategy or external accessory in reducing the SRT in noise. In addition
Full Text Available Abstract Background How does the brain repair obliterated speech and cope with acoustically ambivalent situations? A widely discussed possibility is to use top-down information for solving the ambiguity problem. In the case of speech, this may lead to a match of bottom-up sensory input with lexical expectations resulting in resonant states which are reflected in the induced gamma-band activity (GBA. Methods In the present EEG study, we compared the subject's pre-attentive GBA responses to obliterated speech segments presented after a series of correct words. The words were a minimal pair in German and differed with respect to the degree of specificity of segmental phonological information. Results The induced GBA was larger when the expected lexical information was phonologically fully specified compared to the underspecified condition. Thus, the degree of specificity of phonological information in the mental lexicon correlates with the intensity of the matching process of bottom-up sensory input with lexical information. Conclusions These results together with those of a behavioural control experiment support the notion of multi-level mechanisms involved in the repair of deficient speech. The delineated alignment of pre-existing knowledge with sensory input is in accordance with recent ideas about the role of internal forward models in speech perception.
Alexander L. Francis
Full Text Available Typically, understanding speech seems effortless and automatic. However, a variety of factors may, independently or interactively, make listening more effortful. Physiological measures may help to distinguish between the application of different cognitive mechanisms whose operation is perceived as effortful. In the present study, physiological and behavioral measures associated with task demand were collected along with behavioral measures of performance while participants listened to and repeated sentences. The goal was to measure psychophysiological reactivity associated with three degraded listening conditions, each of which differed in terms of the source of the difficulty (distortion, energetic masking, and informational masking, and therefore were expected to engage different cognitive mechanisms. These conditions were chosen to be matched for overall performance (keywords correct, and were compared to listening to unmasked speech produced by a natural voice. The three degraded conditions were: (1 Unmasked speech produced by a computer speech synthesizer, (2 Speech produced by a natural voice and masked by speech-shaped noise and (3 Speech produced by a natural voice and masked by two-talker babble. Masked conditions were both presented at a -8 dB signal to noise ratio (SNR, a level shown in previous research to result in comparable levels of performance for these stimuli and maskers. Performance was measured in terms of proportion of key words identified correctly, and task demand or effort was quantified subjectively by self-report. Measures of psychophysiological reactivity included electrodermal (skin conductance response frequency and amplitude, blood pulse amplitude and pulse rate. Results suggest that the two masked conditions evoked stronger psychophysiological reactivity than did the two unmasked conditions even when behavioral measures of listening performance and listeners’ subjective perception of task demand were comparable
Roman, Adrienne S; Pisoni, David B; Kronenberger, William G; Faulkner, Kathleen F
Noise-vocoded speech is a valuable research tool for testing experimental hypotheses about the effects of spectral degradation on speech recognition in adults with normal hearing (NH). However, very little research has utilized noise-vocoded speech with children with NH. Earlier studies with children with NH focused primarily on the amount of spectral information needed for speech recognition without assessing the contribution of neurocognitive processes to speech perception and spoken word recognition. In this study, we first replicated the seminal findings reported by ) who investigated effects of lexical density and word frequency on noise-vocoded speech perception in a small group of children with NH. We then extended the research to investigate relations between noise-vocoded speech recognition abilities and five neurocognitive measures: auditory attention (AA) and response set, talker discrimination, and verbal and nonverbal short-term working memory. Thirty-one children with NH between 5 and 13 years of age were assessed on their ability to perceive lexically controlled words in isolation and in sentences that were noise-vocoded to four spectral channels. Children were also administered vocabulary assessments (Peabody Picture Vocabulary test-4th Edition and Expressive Vocabulary test-2nd Edition) and measures of AA (NEPSY AA and response set and a talker discrimination task) and short-term memory (visual digit and symbol spans). Consistent with the findings reported in the original ) study, we found that children perceived noise-vocoded lexically easy words better than lexically hard words. Words in sentences were also recognized better than the same words presented in isolation. No significant correlations were observed between noise-vocoded speech recognition scores and the Peabody Picture Vocabulary test-4th Edition using language quotients to control for age effects. However, children who scored higher on the Expressive Vocabulary test-2nd Edition
Pamplona, María Del Carmen; Ysunza, Pablo Antonio; Morales, Santiago
Children with cleft palate frequently show speech disorders known as compensatory articulation. Compensatory articulation requires a prolonged period of speech intervention that should include reinforcement at home. However, frequently relatives do not know how to work with their children at home. To study whether the use of audiovisual materials especially designed for complementing speech pathology treatment in children with compensatory articulation can be effective for stimulating articulation practice at home and consequently enhancing speech normalization in children with cleft palate. Eighty-two patients with compensatory articulation were studied. Patients were randomly divided into two groups. Both groups received speech pathology treatment aimed to correct articulation placement. In addition, patients from the active group received a set of audiovisual materials to be used at home. Parents were instructed about strategies and ideas about how to use the materials with their children. Severity of compensatory articulation was compared at the onset and at the end of the speech intervention. After the speech therapy period, the group of patients using audiovisual materials at home demonstrated significantly greater improvement in articulation, as compared with the patients receiving speech pathology treatment on - site without audiovisual supporting materials. The results of this study suggest that audiovisual materials especially designed for practicing adequate articulation placement at home can be effective for reinforcing and enhancing speech pathology treatment of patients with cleft palate and compensatory articulation. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Blood, Gordon W; Blood, Ingrid M; Coniglio, Amy D; Finke, Erinn H; Boyle, Michael P
Children with autism spectrum disorders (ASD) are primary targets for bullies and victimization. Research shows school personnel may be uneducated about bullying and ways to intervene. Speech-language pathologists (SLPs) in schools often work with children with ASD and may have victims of bullying on their caseloads. These victims may feel most comfortable turning to SLPs for help during one-to-one treatment sessions to discuss these types of experiences. A nationwide survey mailed to 1000 school-based SLPs, using a vignette design technique, determined perceptions about intervention for bullying and use of specific strategies. Results revealed a majority of the SLPs (89%) responses were in "likely" or "very likely" to intervene categories for all types of bullying (physical, verbal, relational and cyber), regardless of whether the episode was observed or not. A factor analysis was conducted on a 14 item strategy scale for dealing with bullying for children with ASD. Three factors emerged, labeled "Report/Consult", "Educate the Victim", and Reassure the Victim". SLPs providing no services to children with ASD on their caseloads demonstrated significantly lower mean scores for the likelihood of intervention and using select strategies. SLPs may play an important role in reducing and/or eliminating bullying episodes in children with ASD. Readers will be able to (a) explain four different types of bullying, (b) describe the important role of school personnel in reducing and eliminating bullying, (c) describe the perceptions and strategies selected by SLPs to deal with bullying episodes for students with ASD, and (d) outline the potential role of SLPs in assisting students with ASD who are victimized. Copyright © 2013 Elsevier Inc. All rights reserved.
Müürsepp, Iti; Aibast, Herje; Gapeyeva, Helena; Pääsuke, Mati
The aim of the study was to evaluate motor skills, haptic object recognition and social interaction in 5-year-old children with mild specific expressive language impairment (expressive-SLI) and articulation disorder (AD) in comparison of age- and gender matched healthy children. Twenty nine children (23 boys and 6 girls) with expressive-SLI, 27 children (20 boys and 7 girls) with AD and 30 children (23 boys and 7 girls) with typically developing language as controls participated in our study. The children were examined for manual dexterity, ball skills, static and dynamic balance by M-ABC test, haptic object recognition and for social interaction by questionnaire completed by teachers. Children with mild expressive-SLI demonstrated significantly poorer results in all subtests of motor skills (psocial interaction (p0.05) in measured parameters between children with AD and controls. Children with expressive-SLI performed considerably poorer compared to AD group in balance subtest (psocial interaction are considerably more affected than in children with AD. Although motor difficulties in speech production are prevalent in AD, it is localised and does not involve children's general motor skills, haptic perception or social interaction. Copyright © 2011 The Japanese Society of Child Neurology. Published by Elsevier B.V. All rights reserved.
Ramirez, Joshua; Mann, Virginia
Both dyslexics and auditory neuropathy (AN) subjects show inferior consonant-vowel (CV) perception in noise, relative to controls. To better understand these impairments, natural acoustic speech stimuli that were masked in speech-shaped noise at various intensities were presented to dyslexic, AN, and control subjects either in isolation or accompanied by visual articulatory cues. AN subjects were expected to benefit from the pairing of visual articulatory cues and auditory CV stimuli, provided that their speech perception impairment reflects a relatively peripheral auditory disorder. Assuming that dyslexia reflects a general impairment of speech processing rather than a disorder of audition, dyslexics were not expected to similarly benefit from an introduction of visual articulatory cues. The results revealed an increased effect of noise masking on the perception of isolated acoustic stimuli by both dyslexic and AN subjects. More importantly, dyslexics showed less effective use of visual articulatory cues in identifying masked speech stimuli and lower visual baseline performance relative to AN subjects and controls. Last, a significant positive correlation was found between reading ability and the ameliorating effect of visual articulatory cues on speech perception in noise. These results suggest that some reading impairments may stem from a central deficit of speech processing.
Booth, Alannah; Choto, Fadziso; Gotlieb, Jessica; Robertson, Rebecca; Morris, Gabriella; Stockley, Nicola; Mauff, Katya
Background Upon graduation, newly qualified speech-language therapists are expected to provide services independently. This study describes new graduates’ perceptions of their preparedness to provide services across the scope of the profession and explores associations between perceptions of dysphagia theory and clinical learning curricula with preparedness for adult and paediatric dysphagia service delivery. Methods New graduates of six South African universities were recruited to participate in a survey by completing an electronic questionnaire exploring their perceptions of the dysphagia curricula and their preparedness to practise across the scope of the profession of speech-language therapy. Results Eighty graduates participated in the study yielding a response rate of 63.49%. Participants perceived themselves to be well prepared in some areas (e.g. child language: 100%; articulation and phonology: 97.26%), but less prepared in other areas (e.g. adult dysphagia: 50.70%; paediatric dysarthria: 46.58%; paediatric dysphagia: 38.36%) and most unprepared to provide services requiring sign language (23.61%) and African languages (20.55%). There was a significant relationship between perceptions of adequate theory and clinical learning opportunities with assessment and management of dysphagia and perceptions of preparedness to provide dysphagia services. Conclusion There is a need for review of existing curricula and consideration of developing a standard speech-language therapy curriculum across universities, particularly in service provision to a multilingual population, and in both the theory and clinical learning of the assessment and management of adult and paediatric dysphagia, to better equip graduates for practice. PMID:26304217
Alm, Magnus; Behne, Dawn M; Wang, Yue; Eg, Ragnhild
Research shows that noise and phonetic attributes influence the degree to which auditory and visual modalities are used in audio-visual speech perception (AVSP). Research has, however, mainly focused on white noise and single phonetic attributes, thus neglecting the more common babble noise and possible interactions between phonetic attributes. This study explores whether white and babble noise differentially influence AVSP and whether these differences depend on phonetic attributes. White and babble noise of 0 and -12 dB signal-to-noise ratio were added to congruent and incongruent audio-visual stop consonant-vowel stimuli. The audio (A) and video (V) of incongruent stimuli differed either in place of articulation (POA) or voicing. Responses from 15 young adults show that, compared to white noise, babble resulted in more audio responses for POA stimuli, and fewer for voicing stimuli. Voiced syllables received more audio responses than voiceless syllables. Results can be attributed to discrepancies in the acoustic spectra of both the noise and speech target. Voiced consonants may be more auditorily salient than voiceless consonants which are more spectrally similar to white noise. Visual cues contribute to identification of voicing, but only if the POA is visually salient and auditorily susceptible to the noise type.
Lauren J. Amick
Full Text Available Stuttering is a neurodevelopmental disorder characterized by frequent and involuntary disruptions during speech production. Adults who stutter are often subject to negative perceptions. The present study examined whether negative social and cognitive impressions are formed when listening to speech, even without any knowledge about the speaker. Two experiments were conducted in which naïve participants were asked to listen to and provide ratings on samples of read speech produced by adults who stutter and typically-speaking adults without knowledge about the individuals who produced the speech. In both experiments, listeners rated speaker cognitive ability, likeability, anxiety, as well as a number of speech characteristics that included fluency, naturalness, intelligibility, the likelihood the speaker had a speech-and-language disorder (Experiment 1 only, rate and volume (both Experiments 1 and 2. The speech of adults who stutter was perceived to be less fluent, natural, intelligible, and to be slower and louder than the speech of typical adults. Adults who stutter were also perceived to have lower cognitive ability, to be less likeable and to be more anxious than the typical adult speakers. Relations between speech characteristics and social and cognitive impressions were found, independent of whether or not the speaker stuttered (i.e., they were found for both adults who stutter and typically-speaking adults and did not depend on being cued that some of the speakers may have had a speech-language impairment.
Cho, Soojin; Yu, Jyaehyoung; Chun, Hyungi; Seo, Hyekyung; Han, Woojae
Deficits of the aging auditory system negatively affect older listeners in terms of speech communication, resulting in limitations to their social lives. To improve their perceptual skills, the goal of this study was to investigate the effects of time alteration, selective word stress, and varying sentence lengths on the speech perception of older listeners. Seventeen older people with normal hearing were tested for seven conditions of different time-altered sentences (i.e., ±60%, ±40%, ±20%, 0%), two conditions of selective word stress (i.e., no-stress and stress), and three different lengths of sentences (i.e., short, medium, and long) at the most comfortable level for individuals in quiet circumstances. As time compression increased, sentence perception scores decreased statistically. Compared to a natural (or no stress) condition, the selectively stressed words significantly improved the perceptual scores of these older listeners. Long sentences yielded the worst scores under all time-altered conditions. Interestingly, there was a noticeable positive effect for the selective word stress at the 20% time compression. This pattern of results suggests that a combination of time compression and selective word stress is more effective for understanding speech in older listeners than using the time-expanded condition only.
Hamza, Yasmeen; Okalidou, Areti; Kyriafinis, George; van Wieringen, Astrid
Sonority is the relative perceptual prominence/loudness of speech sounds of the same length, stress, and pitch. Children with cochlear implants (CIs), with restored audibility and relatively intact temporal processing, are expected to benefit from the perceptual prominence cues of highly sonorous sounds. Sonority also influences lexical access through the sonority-sequencing principle (SSP), a grammatical phonotactic rule, which facilitates the recognition and segmentation of syllables within speech. The more nonsonorous the onset of a syllable is, the larger is the degree of sonority rise to the nucleus, and the more optimal the SSP. Children with CIs may experience hindered or delayed development of the language-learning rule SSP, as a result of their deprived/degraded auditory experience. The purpose of the study was to explore sonority's role in speech perception and lexical access of prelingually deafened children with CIs. A case-control study with 15 children with CIs, 25 normal-hearing children (NHC), and 50 normal-hearing adults was conducted, using a lexical identification task of novel, nonreal CV-CV words taught via fast mapping. The CV-CV words were constructed according to four sonority conditions, entailing syllables with sonorous onsets/less optimal SSP (SS) and nonsonorous onsets/optimal SSP (NS) in all combinations, that is, SS-SS, SS-NS, NS-SS, and NS-NS. Outcome measures were accuracy and reaction times (RTs). A subgroup analysis of 12 children with CIs pair matched to 12 NHC on hearing age aimed to study the effect of oral-language exposure period on the sonority-related performance. The children groups showed similar accuracy performance, overall and across all the sonority conditions. However, within-group comparisons showed that the children with CIs scored more accurately on the SS-SS condition relative to the NS-NS and NS-SS conditions, while the NHC performed equally well across all conditions. Additionally, adult-comparable accuracy
Stropahl, Maren; Debener, Stefan
There is clear evidence for cross-modal cortical reorganization in the auditory system of post-lingually deafened cochlear implant (CI) users. A recent report suggests that moderate sensori-neural hearing loss is already sufficient to initiate corresponding cortical changes. To what extend these changes are deprivation-induced or related to sensory recovery is still debated. Moreover, the influence of cross-modal reorganization on CI benefit is also still unclear. While reorganization during deafness may impede speech recovery, reorganization also has beneficial influences on face recognition and lip-reading. As CI users were observed to show differences in multisensory integration, the question arises if cross-modal reorganization is related to audio-visual integration skills. The current electroencephalography study investigated cortical reorganization in experienced post-lingually deafened CI users ( n = 18), untreated mild to moderately hearing impaired individuals (n = 18) and normal hearing controls ( n = 17). Cross-modal activation of the auditory cortex by means of EEG source localization in response to human faces and audio-visual integration, quantified with the McGurk illusion, were measured. CI users revealed stronger cross-modal activations compared to age-matched normal hearing individuals. Furthermore, CI users showed a relationship between cross-modal activation and audio-visual integration strength. This may further support a beneficial relationship between cross-modal activation and daily-life communication skills that may not be fully captured by laboratory-based speech perception tests. Interestingly, hearing impaired individuals showed behavioral and neurophysiological results that were numerically between the other two groups, and they showed a moderate relationship between cross-modal activation and the degree of hearing loss. This further supports the notion that auditory deprivation evokes a reorganization of the auditory system
Full Text Available There is clear evidence for cross-modal cortical reorganization in the auditory system of post-lingually deafened cochlear implant (CI users. A recent report suggests that moderate sensori-neural hearing loss is already sufficient to initiate corresponding cortical changes. To what extend these changes are deprivation-induced or related to sensory recovery is still debated. Moreover, the influence of cross-modal reorganization on CI benefit is also still unclear. While reorganization during deafness may impede speech recovery, reorganization also has beneficial influences on face recognition and lip-reading. As CI users were observed to show differences in multisensory integration, the question arises if cross-modal reorganization is related to audio-visual integration skills. The current electroencephalography study investigated cortical reorganization in experienced post-lingually deafened CI users (n = 18, untreated mild to moderately hearing impaired individuals (n = 18 and normal hearing controls (n = 17. Cross-modal activation of the auditory cortex by means of EEG source localization in response to human faces and audio-visual integration, quantified with the McGurk illusion, were measured. CI users revealed stronger cross-modal activations compared to age-matched normal hearing individuals. Furthermore, CI users showed a relationship between cross-modal activation and audio-visual integration strength. This may further support a beneficial relationship between cross-modal activation and daily-life communication skills that may not be fully captured by laboratory-based speech perception tests. Interestingly, hearing impaired individuals showed behavioral and neurophysiological results that were numerically between the other two groups, and they showed a moderate relationship between cross-modal activation and the degree of hearing loss. This further supports the notion that auditory deprivation evokes a reorganization of the
Chan, Yu Man; Pianta, Michael J; McKendrick, Allison M
Perceived synchrony of visual and auditory signals can be altered by exposure to a stream of temporally offset stimulus pairs. Previous literature suggests that adapting to audiovisual temporal offsets is an important recalibration to correctly combine audiovisual stimuli into a single percept across a range of source distances. Healthy aging results in synchrony perception over a wider range of temporally offset visual and auditory signals, independent of age-related unisensory declines in vision and hearing sensitivities. However, the impact of aging on audiovisual recalibration is unknown. Audiovisual synchrony perception for sound-lead and sound-lag stimuli was measured for 15 younger (22-32 years old) and 15 older (64-74 years old) healthy adults using a method-of-constant-stimuli, after adapting to a stream of visual and auditory pairs. The adaptation pairs were either synchronous or asynchronous (sound-lag of 230 ms). The adaptation effect for each observer was computed as the shift in the mean of the individually fitted psychometric functions after adapting to asynchrony. Post-adaptation to synchrony, the younger and older observers had average window widths (±standard deviation) of 326 (±80) and 448 (±105) ms, respectively. There was no adaptation effect for sound-lead pairs. Both the younger and older observers, however, perceived more sound-lag pairs as synchronous. The magnitude of the adaptation effect in the older observers was not correlated with how often they saw the adapting sound-lag stimuli as asynchronous. Our finding demonstrates that audiovisual synchrony perception adapts less with advancing age.
Noel, Jean-Paul; De Niear, Matthew; Van der Burg, Erik; Wallace, Mark T
Multisensory interactions are well established to convey an array of perceptual and behavioral benefits. One of the key features of multisensory interactions is the temporal structure of the stimuli combined. In an effort to better characterize how temporal factors influence multisensory interactions across the lifespan, we examined audiovisual simultaneity judgment and the degree of rapid recalibration to paired audiovisual stimuli (Flash-Beep and Speech) in a sample of 220 participants ranging from 7 to 86 years of age. Results demonstrate a surprisingly protracted developmental time-course for both audiovisual simultaneity judgment and rapid recalibration, with neither reaching maturity until well into adolescence. Interestingly, correlational analyses revealed that audiovisual simultaneity judgments (i.e., the size of the audiovisual temporal window of simultaneity) and rapid recalibration significantly co-varied as a function of age. Together, our results represent the most complete description of age-related changes in audiovisual simultaneity judgments to date, as well as being the first to describe changes in the degree of rapid recalibration as a function of age. We propose that the developmental time-course of rapid recalibration scaffolds the maturation of more durable audiovisual temporal representations.
Chen, Junwen; McLean, Jordan E; Kemps, Eva
This study investigated the effects of combined audience feedback with video feedback plus cognitive preparation, and cognitive review (enabling deeper processing of feedback) on state anxiety and self-perceptions including perception of performance and perceived probability of negative evaluation in socially anxious individuals during a speech performance. One hundred and forty socially anxious students were randomly assigned to four conditions: Cognitive Preparation + Video Feedback + Audience Feedback + Cognitive Review (CP+VF+AF+CR), Cognitive Preparation + Video Feedback + Cognitive Review (CP+VF+CR), Cognitive Preparation + Video Feedback only (CP+VF), and Control. They were asked to deliver two impromptu speeches that were evaluated by confederates. Participants' levels of anxiety and self-perceptions pertaining to the speech task were assessed before and after feedback, and after the second speech. Compared to participants in the other conditions, participants in the CP+VF+AF+CR condition reported a significant decrease in their state anxiety and perceived probability of negative evaluation scores, and a significant increase in their positive perception of speech performance from before to after the feedback. These effects generalized to the second speech. Our results suggest that adding audience feedback to video feedback plus cognitive preparation and cognitive review may improve the effects of existing video feedback procedures in reducing anxiety symptoms and distorted self-representations in socially anxious individuals. Copyright © 2017. Published by Elsevier Ltd.
Casini, Laurence; Burle, Boris; Nguyen, Noel
Time is essential to speech. The duration of speech segments plays a critical role in the perceptual identification of these segments, and therefore in that of spoken words. Here, using a French word identification task, we show that vowels are perceived as shorter when attention is divided between two tasks, as compared to a single task control…
Rosner, Burton S.; Talcott, Joel B.; Witton, Caroline; Hogg, James D.; Richardson, Alexandra J.; Hansen, Peter C.; Stein, John F.
"Sine-wave speech" sentences contain only four frequency-modulated sine waves, lacking many acoustic cues present in natural speech. Adults with (n=19) and without (n=14) dyslexia were asked to reproduce orally sine-wave utterances in successive trials. Results suggest comprehension of sine-wave sentences is impaired in some adults with…
Greenwood, Nan; Wright, Jannet A.; Bithell, Christine
Background: Communication disorders affect both sexes and people from all ethnic groups, but members of minority ethnic groups and males in the UK are underrepresented in the speech and language therapy profession. Research in the area of recruitment is limited, but a possible explanation is poor awareness and understanding of speech and language…
Jaekel, Brittany N.; Newman, Rochelle S.; Goupell, Matthew J.
Purpose: Normal-hearing (NH) listeners rate normalize, temporarily remapping phonemic category boundaries to account for a talker's speech rate. It is unknown if adults who use auditory prostheses called cochlear implants (CI) can rate normalize, as CIs transmit degraded speech signals to the auditory nerve. Ineffective adjustment to rate…
Meister, Hartmut; Schreitmüller, Stefan; Grugel, Linda; Beutner, Dirk; Walger, Martin; Meister, Ingo
The purpose of this study was to investigate the relationship of cognitive functions (i.e., working memory [WM]) and speech recognition against different background maskers in older individuals. Speech reception thresholds (SRTs) were determined using a matrix-sentence test. Unmodulated noise, modulated noise (International Collegium for Rehabilitative Audiology [ICRA] noise 5-250), and speech fragments (International Speech Test Signal [ISTS]) were used as background maskers. Verbal WM was assessed using the Verbal Learning and Memory Test (VLMT; Helmstaedter & Durwen, 1990). Measurements were conducted with 14 normal-hearing older individuals and a control group of 12 normal-hearing young listeners. Despite their normal hearing ability, the young listeners outperformed the older individuals in all background maskers. These differences were largest for the modulated maskers. SRTs were significantly correlated with the scores of the VLMT. A linear regression model also included WM as the only significant predictor variable. The results support the assumption that WM plays an important role for speech understanding and that it might have impact on results obtained using speech audiometry. Thus, an individual's WM capacity should be considered with aural diagnosis and rehabilitation. The VLMT proved to be a clinically applicable test for WM. Further cognitive functions important with speech understanding are currently being investigated within the SAKoLA (Sprachaudiometrie und kognitive Leistungen im Alter [Speech Audiometry and Cognitive Functions in the Elderly]) project.
Geers, Ann E; Davidson, Lisa S; Uchanski, Rosalie M; Nicholas, Johanna G
This study documented the ability of experienced pediatric cochlear implant (CI) users to perceive linguistic properties (what is said) and indexical attributes (emotional intent and talker identity) of speech, and examined the extent to which linguistic (LSP) and indexical (ISP) perception skills are related. Preimplant-aided hearing, age at implantation, speech processor technology, CI-aided thresholds, sequential bilateral cochlear implantation, and academic integration with hearing age-mates were examined for their possible relationships to both LSP and ISP skills. Sixty 9- to 12-year olds, first implanted at an early age (12 to 38 months), participated in a comprehensive test battery that included the following LSP skills: (1) recognition of monosyllabic words at loud and soft levels, (2) repetition of phonemes and suprasegmental features from nonwords, and (3) recognition of key words from sentences presented within a noise background, and the following ISP skills: (1) discrimination of across-gender and within-gender (female) talkers and (2) identification and discrimination of emotional content from spoken sentences. A group of 30 age-matched children without hearing loss completed the nonword repetition, and talker- and emotion-perception tasks for comparison. Word-recognition scores decreased with signal level from a mean of 77% correct at 70 dB SPL to 52% at 50 dB SPL. On average, CI users recognized 50% of key words presented in sentences that were 9.8 dB above background noise. Phonetic properties were repeated from nonword stimuli at about the same level of accuracy as suprasegmental attributes (70 and 75%, respectively). The majority of CI users identified emotional content and differentiated talkers significantly above chance levels. Scores on LSP and ISP measures were combined into separate principal component scores and these components were highly correlated (r = 0.76). Both LSP and ISP component scores were higher for children who received a CI
Full Text Available We examined the cross-modal effect of irrelevant sound (or disk on the perceived visual (or auditory duration, and how visual and auditory signals are integrated when perceiving the duration. Participants conducted a duration discrimination task with a 2-Interval-Forced-Choice procedure, with one interval containing the standard duration and the other the comparison duration. In study 1, the standard and comparison durations were either in the same modality or with another modality added. The point-of-subjective-equality and threshold were measured from the psychometric functions. Results showed that sound expanded the perceived visual duration at the intermediate durations but there was no effect of disk on the perceived auditory duration. In study 2, bimodal signals were used in both the standard and comparison durations and the Maximum-Likelihood-Estimation (MLE model was used to predict bimodal performance from the observed unimodal results. The contribution of auditory signals to the bimodal estimate of duration was greater than that predicted by the MLE model, and so was the contribution of visual signals when these signals were temporally informative (ie, looming disks. We propose a hybrid model that considers both the prior bias for auditory signal and the reliability of both auditory and visual signals to explain the results.
Melissa M Baese-Berk
Full Text Available Neil Armstrong insisted that his quote upon landing on the moon was misheard, and that he had said one small step for a man, instead of one small step for man. What he said is unclear in part because function words like a can be reduced and spectrally indistinguishable from the preceding context. Therefore, their presence can be ambiguous, and they may disappear perceptually depending on the rate of surrounding speech. Two experiments are presented examining production and perception of reduced tokens of for and for a in spontaneous speech. Experiment 1 investigates the distributions of several acoustic features of for and for a. The results suggest that the distributions of for and for a overlap substantially, both in terms of temporal and spectral characteristics. Experiment 2 examines perception of these same tokens when the context speaking rate differs. The perceptibility of the function word a varies as a function of this context speaking rate. These results demonstrate that substantial ambiguity exists in the original quote from Armstrong, and that this ambiguity may be understood through context speaking rate.
Baese-Berk, Melissa M; Dilley, Laura C; Schmidt, Stephanie; Morrill, Tuuli H; Pitt, Mark A
Neil Armstrong insisted that his quote upon landing on the moon was misheard, and that he had said one small step for a man, instead of one small step for man. What he said is unclear in part because function words like a can be reduced and spectrally indistinguishable from the preceding context. Therefore, their presence can be ambiguous, and they may disappear perceptually depending on the rate of surrounding speech. Two experiments are presented examining production and perception of reduced tokens of for and for a in spontaneous speech. Experiment 1 investigates the distributions of several acoustic features of for and for a. The results suggest that the distributions of for and for a overlap substantially, both in terms of temporal and spectral characteristics. Experiment 2 examines perception of these same tokens when the context speaking rate differs. The perceptibility of the function word a varies as a function of this context speaking rate. These results demonstrate that substantial ambiguity exists in the original quote from Armstrong, and that this ambiguity may be understood through context speaking rate.
Burnham, Denis; Dodd, Barbara
The McGurk effect, in which auditory [ba] dubbed onto [ga] lip movements is perceived as "da" or "tha," was employed in a real-time task to investigate auditory-visual speech perception in prelingual infants. Experiments 1A and 1B established the validity of real-time dubbing for producing the effect. In Experiment 2, 4 1/2-month-olds were tested in a habituation-test paradigm, in which an auditory-visual stimulus was presented contingent upon visual fixation of a live face. The experimental group was habituated to a McGurk stimulus (auditory [ba] visual [ga]), and the control group to matching auditory-visual [ba]. Each group was then presented with three auditory-only test trials, [ba], [da], and [(delta)a] (as in then). Visual-fixation durations in test trials showed that the experimental group treated the emergent percept in the McGurk effect, [da] or [(delta)a], as familiar (even though they had not heard these sounds previously) and [ba] as novel. For control group infants [da] and [(delta)a] were no more familiar than [ba]. These results are consistent with infants' perception of the McGurk effect, and support the conclusion that prelinguistic infants integrate auditory and visual speech information. Copyright 2004 Wiley Periodicals, Inc.
Today, huge quantities of digital audiovisual resources are already available - everywhere and at any time - through Web portals, online archives and libraries, and video blogs. One central question with respect to this huge amount of audiovisual data is how they can be used in specific (social, pedagogical, etc.) contexts and what are their potential interest for target groups (communities, professionals, students, researchers, etc.).This book examines the question of the (creative) exploitation of digital audiovisual archives from a theoretical, methodological, technical and practical
Narayan, Chandan R; Werker, Janet F; Beddor, Patrice Speeter
Previous research suggests that infant speech perception reorganizes in the first year: young infants discriminate both native and non-native phonetic contrasts, but by 10-12 months difficult non-native contrasts are less discriminable whereas performance improves on native contrasts. In the current study, four experiments tested the hypothesis that, in addition to the influence of native language experience, acoustic salience also affects the perceptual reorganization that takes place in infancy. Using a visual habituation paradigm, two nasal place distinctions that differ in relative acoustic salience, acoustically robust labial-alveolar [ma]-[na] and acoustically less salient alveolar-velar [na]-[ enga], were presented to infants in a cross-language design. English-learning infants at 6-8 and 10-12 months showed discrimination of the native and acoustically robust [ma]-[na] (Experiment 1), but not the non-native (in initial position) and acoustically less salient [na]-[ enga] (Experiment 2). Very young (4-5-month-old) English-learning infants tested on the same native and non-native contrasts also showed discrimination of only the [ma]-[na] distinction (Experiment 3). Filipino-learning infants, whose ambient language includes the syllable-initial alveolar (/n/)-velar (/ eng/) contrast, showed discrimination of native [na]-[ enga] at 10-12 months, but not at 6-8 months (Experiment 4). These results support the hypothesis that acoustic salience affects speech perception in infancy, with native language experience facilitating discrimination of an acoustically similar phonetic distinction [na]-[ enga]. We discuss the implications of this developmental profile for a comprehensive theory of speech perception in infancy.
Távora-Vieira, Dayse; Marino, Roberta; Acharya, Aanand; Rajan, Gunesh P
This study aimed to determine the impact of cochlear implantation on speech understanding in noise, subjective perception of hearing, and tinnitus perception of adult patients with unilateral severe to profound hearing loss and to investigate whether duration of deafness and age at implantation would influence the outcomes. In addition, this article describes the auditory training protocol used for unilaterally deaf patients. This is a prospective study of subjects undergoing cochlear implantation for unilateral deafness with or without associated tinnitus. Speech perception in noise was tested using the Bamford-Kowal-Bench speech-in-noise test presented at 65 dB SPL. The Speech, Spatial, and Qualities of Hearing Scale and the Abbreviated Profile of Hearing Aid Benefit were used to evaluate the subjective perception of hearing with a cochlear implant and quality of life. Tinnitus disturbance was measured using the Tinnitus Reaction Questionnaire. Data were collected before cochlear implantation and 3, 6, 12, and 24 months after implantation. Twenty-eight postlingual unilaterally deaf adults with or without tinnitus were implanted. There was a significant improvement in speech perception in noise across time in all spatial configurations. There was an overall significant improvement on the subjective perception of hearing and quality of life. Tinnitus disturbance reduced significantly across time. Age at implantation and duration of deafness did not influence the outcomes significantly. Cochlear implantation provided significant improvement in speech understanding in challenging situations, subjective perception of hearing performance, and quality of life. Cochlear implantation also resulted in reduced tinnitus disturbance. Age at implantation and duration of deafness did not seem to influence the outcomes.
Gebru, Israel D; Ba, Sileye; Li, Xiaofei; Horaud, Radu
Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants rather than facing the cameras and the microphones. Multiple-person visual tracking is combined with multiple speech-source localization in order to tackle the speech-to-person association problem. The latter is solved within a novel audio-visual fusion method on the following grounds: binaural spectral features are first extracted from a microphone pair, then a supervised audio-visual alignment technique maps these features onto an image, and finally a semi-supervised clustering method assigns binaural spectral features to visible persons. The main advantage of this method over previous work is that it processes in a principled way speech signals uttered simultaneously by multiple persons. The diarization itself is cast into a latent-variable temporal graphical model that infers speaker identities and speech turns, based on the output of an audio-visual association process, executed at each time slice, and on the dynamics of the diarization variable itself. The proposed formulation yields an efficient exact inference procedure. A novel dataset, that contains audio-visual training data as well as a number of scenarios involving several participants engaged in formal and informal dialogue, is introduced. The proposed method is thoroughly tested and benchmarked with respect to several state-of-the art diarization algorithms.
Several articles addressing topics in speech research are presented. The topics include: exploring the functional significance of physiological tremor: A biospectroscopic approach; differences between experienced and inexperienced listeners to deaf speech; a language-oriented view of reading and its disabilities; Phonetic factors in letter detection; categorical perception; Short-term recall by deaf signers of American sign language; a common basis for auditory sensory storage in perception and immediate memory; phonological awareness and verbal short-term memory; initiation versus execution time during manual and oral counting by stutterers; trading relations in the perception of speech by five-year-old children; the role of the strap muscles in pitch lowering; phonetic validation of distinctive features; consonants and syllable boundaires; and vowel information in postvocalic frictions.
Varnet, Léo; Knoblauch, Kenneth; Serniclaes, Willy; Meunier, Fanny; Hoen, Michel
Although there is a large consensus regarding the involvement of specific acoustic cues in speech perception, the precise mechanisms underlying the transformation from continuous acoustical properties into discrete perceptual units remains undetermined. This gap in knowledge is partially due to the lack of a turnkey solution for isolating critical speech cues from natural stimuli. In this paper, we describe a psychoacoustic imaging method known as the Auditory Classification Image technique that allows experimenters to estimate the relative importance of time-frequency regions in categorizing natural speech utterances in noise. Importantly, this technique enables the testing of hypotheses on the listening strategies of participants at the group level. We exemplify this approach by identifying the acoustic cues involved in da/ga categorization with two phonetic contexts, Al- or Ar-. The application of Auditory Classification Images to our group of 16 participants revealed significant critical regions on the second and third formant onsets, as predicted by the literature, as well as an unexpected temporal cue on the first formant. Finally, through a cluster-based nonparametric test, we demonstrate that this method is sufficiently sensitive to detect fine modifications of the classification strategies between different utterances of the same phoneme.
Full Text Available Although there is a large consensus regarding the involvement of specific acoustic cues in speech perception, the precise mechanisms underlying the transformation from continuous acoustical properties into discrete perceptual units remains undetermined. This gap in knowledge is partially due to the lack of a turnkey solution for isolating critical speech cues from natural stimuli. In this paper, we describe a psychoacoustic imaging method known as the Auditory Classification Image technique that allows experimenters to estimate the relative importance of time-frequency regions in categorizing natural speech utterances in noise. Importantly, this technique enables the testing of hypotheses on the listening strategies of participants at the group level. We exemplify this approach by identifying the acoustic cues involved in da/ga categorization with two phonetic contexts, Al- or Ar-. The application of Auditory Classification Images to our group of 16 participants revealed significant critical regions on the second and third formant onsets, as predicted by the literature, as well as an unexpected temporal cue on the first formant. Finally, through a cluster-based nonparametric test, we demonstrate that this method is sufficiently sensitive to detect fine modifications of the classification strategies between different utterances of the same phoneme.
Slater, Jessica; Skoe, Erika; Strait, Dana L; O'Connell, Samantha; Thompson, Elaine; Kraus, Nina
Music training may strengthen auditory skills that help children not only in musical performance but in everyday communication. Comparisons of musicians and non-musicians across the lifespan have provided some evidence for a "musician advantage" in understanding speech in noise, although reports have been mixed. Controlled longitudinal studies are essential to disentangle effects of training from pre-existing differences, and to determine how much music training is necessary to confer benefits. We followed a cohort of elementary school children for 2 years, assessing their ability to perceive speech in noise before and after musical training. After the initial assessment, participants were randomly assigned to one of two groups: one group began music training right away and completed 2 years of training, while the second group waited a year and then received 1 year of music training. Outcomes provide the first longitudinal evidence that speech-in-noise perception improves after 2 years of group music training. The children were enrolled in an established and successful community-based music program and followed the standard curriculum, therefore these findings provide an important link between laboratory-based research and real-world assessment of the impact of music training on everyday communication skills. Copyright © 2015 Elsevier B.V. All rights reserved.
Moberly, Aaron C; Patel, Tirth R; Castellanos, Irina
As a result of their hearing loss, adults with cochlear implants (CIs) would self-report poorer executive functioning (EF) skills than normal-hearing (NH) peers, and these EF skills would be associated with performance on speech recognition tasks. EF refers to a group of high order neurocognitive skills responsible for behavioral and emotional regulation during goal-directed activity, and EF has been found to be poorer in children with CIs than their NH age-matched peers. Moreover, there is increasing evidence that neurocognitive skills, including some EF skills, contribute to the ability to recognize speech through a CI. Thirty postlingually deafened adults with CIs and 42 age-matched NH adults were enrolled. Participants and their spouses or significant others (informants) completed well-validated self-reports or informant-reports of EF, the Behavior Rating Inventory of Executive Function - Adult (BRIEF-A). CI users' speech recognition skills were assessed in quiet using several measures of sentence recognition. NH peers were tested for recognition of noise-vocoded versions of the same speech stimuli. CI users self-reported difficulty on EF tasks of shifting and task monitoring. In CI users, measures of speech recognition correlated with several self-reported EF skills. The present findings provide further evidence that neurocognitive factors, including specific EF skills, may decline in association with hearing loss, and that some of these EF skills contribute to speech processing under degraded listening conditions.
Full Text Available Considerable progress has been made in the treatment of hearing loss with auditory implants. However, there are still many implanted patients that experience hearing deficiencies, such as limited speech understanding or vanishing perception with continuous stimulation (i.e., abnormal loudness adaptation. The present study aims to identify specific patterns of cerebral cortex activity involved with such deficiencies. We performed O-15-water positron emission tomography (PET in patients implanted with electrodes within the cochlea, brainstem, or midbrain to investigate the pattern of cortical activation in response to speech or continuous multi-tone stimuli directly inputted into the implant processor that then delivered electrical patterns through those electrodes. Statistical parametric mapping was performed on a single subject basis. Better speech understanding was correlated with a larger extent of bilateral auditory cortex activation. In contrast to speech, the continuous multi-tone stimulus elicited mainly unilateral auditory cortical activity in which greater loudness adaptation corresponded to weaker activation and even deactivation. Interestingly, greater loudness adaptation was correlated with stronger activity within the ventral prefrontal cortex, which could be up-regulated to suppress the irrelevant or aberrant signals into the auditory cortex. The ability to detect these specific cortical patterns and differences across patients and stimuli demonstrates the potential for using PET to diagnose auditory function or dysfunction in implant patients, which in turn could guide the development of appropriate stimulation strategies for improving hearing rehabilitation. Beyond hearing restoration, our study also reveals a potential role of the frontal cortex in suppressing irrelevant or aberrant activity within the auditory cortex, and thus may be relevant for understanding and treating tinnitus.
Berding, Georg; Wilke, Florian; Rode, Thilo; Haense, Cathleen; Joseph, Gert; Meyer, Geerd J; Mamach, Martin; Lenarz, Minoo; Geworski, Lilli; Bengel, Frank M; Lenarz, Thomas; Lim, Hubert H
Considerable progress has been made in the treatment of hearing loss with auditory implants. However, there are still many implanted patients that experience hearing deficiencies, such as limited speech understanding or vanishing perception with continuous stimulation (i.e., abnormal loudness adaptation). The present study aims to identify specific patterns of cerebral cortex activity involved with such deficiencies. We performed O-15-water positron emission tomography (PET) in patients implanted with electrodes within the cochlea, brainstem, or midbrain to investigate the pattern of cortical activation in response to speech or continuous multi-tone stimuli directly inputted into the implant processor that then delivered electrical patterns through those electrodes. Statistical parametric mapping was performed on a single subject basis. Better speech understanding was correlated with a larger extent of bilateral auditory cortex activation. In contrast to speech, the continuous multi-tone stimulus elicited mainly unilateral auditory cortical activity in which greater loudness adaptation corresponded to weaker activation and even deactivation. Interestingly, greater loudness adaptation was correlated with stronger activity within the ventral prefrontal cortex, which could be up-regulated to suppress the irrelevant or aberrant signals into the auditory cortex. The ability to detect these specific cortical patterns and differences across patients and stimuli demonstrates the potential for using PET to diagnose auditory function or dysfunction in implant patients, which in turn could guide the development of appropriate stimulation strategies for improving hearing rehabilitation. Beyond hearing restoration, our study also reveals a potential role of the frontal cortex in suppressing irrelevant or aberrant activity within the auditory cortex, and thus may be relevant for understanding and treating tinnitus.
Deschamps, Isabelle; Tremblay, Pascale
The processing of fluent speech involves complex computational steps that begin with the segmentation of the continuous flow of speech sounds into syllables and words. One question that naturally arises pertains to the type of syllabic information that speech processes act upon. Here, we used functional magnetic resonance imaging to profile regions, using a combination of whole-brain and exploratory anatomical region-of-interest (ROI) approaches, that were sensitive to syllabic information during speech perception by parametrically manipulating syllabic complexity along two dimensions: (1) individual syllable complexity, and (2) sequence complexity (supra-syllabic). We manipulated the complexity of the syllable by using the simplest syllable template-a consonant and vowel (CV)-and inserting an additional consonant to create a complex onset (CCV). The supra-syllabic complexity was manipulated by creating sequences composed of the same syllable repeated six times (e.g., /pa-pa-pa-pa-pa-pa/) and sequences of three different syllables each repeated twice (e.g., /pa-ta-ka-pa-ta-ka/). This parametrical design allowed us to identify brain regions sensitive to (1) syllabic complexity independent of supra-syllabic complexity, (2) supra-syllabic complexity independent of syllabic complexity and, (3) both syllabic and supra-syllabic complexity. High-resolution scans were acquired for 15 healthy adults. An exploratory anatomical ROI analysis of the supratemporal plane (STP) identified bilateral regions within the anterior two-third of the planum temporale, the primary auditory cortices as well as the anterior two-third of the superior temporal gyrus that showed different patterns of sensitivity to syllabic and supra-syllabic information. These findings demonstrate that during passive listening of syllable sequences, sublexical information is processed automatically, and sensitivity to syllabic and supra-syllabic information is localized almost exclusively within the STP.
Møller, Anders Kalsgaard; Hoffmann, Pablo F.; Carrozzino, Marcello
The state-of-the-art speech intelligibility tests are created with the purpose of evaluating acoustic communication devices and not for evaluating audio-visual virtual reality systems. This paper present a novel method to evaluate a communication situation based on both the speech intelligibility...
Schmid, Gabriele; Thielmann, Anke; Ziegler, Wolfram
Patients with lesions of the left hemisphere often suffer from oral-facial apraxia, apraxia of speech, and aphasia. In these patients, visual features often play a critical role in speech and language therapy, when pictured lip shapes or the therapist's visible mouth movements are used to facilitate speech production and articulation. This demands…
Zekveld, Adriana A; Kramer, Sophia E; Festen, Joost M
correlation coefficients indicated that the cognitive load was larger in listeners with better TRT performances as reflected by a longer peak latency (normal-hearing participants, SRT50% condition) and a larger peak amplitude and longer response duration (hearing-impaired participants, SRT50% and SRT84% conditions). Also, a larger word vocabulary was related to longer response duration in the SRT84% condition for the participants with normal hearing. The pupil response systematically increased with decreasing speech intelligibility. Ageing and hearing loss were related to less release from effort when increasing the intelligibility of speech in noise. In difficult listening conditions, these factors may induce cognitive overload relatively early or they may be associated with relatively shallow speech processing. More research is needed to elucidate the underlying mechanisms explaining these results. Better TRTs and larger word vocabulary were related to higher mental processing load across speech intelligibility levels. This indicates that utilizing linguistic ability to improve speech perception is associated with increased listening load.
Wiinberg, Alan; Jepsen, Morten Løve; Epp, Bastian
Objective: The purpose was to investigate the effects of hearing-loss and fast-acting compression on speech intelligibility and two measures of temporal modulation sensitivity. Design: Twelve adults with normal hearing (NH) and 16 adults with mild to moderately severe sensorineural hearing loss......, the MDD thresholds were higher for the group with hearing loss than for the group with NH. Fast-acting compression increased the modulation detection thresholds, while no effect of compression on the MDD thresholds was observed. The speech reception thresholds obtained in stationary noise were slightly...... of the modulation detection thresholds, compression does not seem to provide a benefit for speech intelligibility. Furthermore, fast-acting compression may not be able to restore MDD thresholds to the values observed for listeners with NH, suggesting that the two measures of amplitude modulation sensitivity...
Strelcyk, Olaf; Dau, Torsten
Hearing-impaired people often experience great difficulty with speech communication when background noise is present, even if reduced audibility has been compensated for. Other impairment factors must be involved. In order to minimize confounding effects, the subjects participating in this study...... consisted of groups with homogeneous, symmetric audiograms. The perceptual listening experiments assessed the intelligibility of full-spectrum as well as low-pass filtered speech in the presence of stationary and fluctuating interferers, the individual's frequency selectivity and the integrity of temporal...... modulation were obtained. In addition, these binaural and monaural thresholds were measured in a stationary background noise in order to assess the persistence of the fine-structure processing to interfering noise. Apart from elevated speech reception thresholds, the hearing impaired listeners showed poorer...
Laccourreye, L; Ettienne, V; Prang, I; Couloigner, V; Garabedian, E-N; Loundon, N
To analyze speech in children with profound hearing loss following congenital cytomegalovirus (cCMV) infection with cochlear implantation (CI) before the age of 3 years. In a cohort of 15 children with profound hearing loss, speech perception, production and intelligibility were assessed before and 3 years after CI; variables impacting results were explored. Post-CI, median word recognition was 74% on closed-list and 48% on open-list testing; 80% of children acquired speech production; and 60% were intelligible for all listeners or listeners attentive to lip-reading and/or aware of the child's hearing loss. Univariate analysis identified 3 variables (mean post-CI hearing threshold, bilateral vestibular areflexia, and brain abnormality on MRI) with significant negative impact on the development of speech perception, production and intelligibility. CI showed positive impact on hearing and speech in children with post-cCMV profound hearing loss. Our study demonstrated the key role of maximizing post-CI hearing gain. A few children had insufficient progress, especially in case of bilateral vestibular areflexia and/or brain abnormality on MRI. This led us to suggest that balance rehabilitation and speech therapy should be intensified in such cases. Copyright © 2015 Elsevier Masson SAS. All rights reserved.
Vickers, Deborah A; Backus, Bradford C; Macdonald, Nora K; Rostamzadeh, Niloofar K; Mason, Nisha K; Pandya, Roshni; Marriage, Josephine E; Mahon, Merle H
The assessment of the combined effect of classroom acoustics and sound field amplification (SFA) on children's speech perception within the "live" classroom poses a challenge to researchers. The goals of this study were to determine: (1) Whether personal response system (PRS) hand-held voting cards, together with a closed-set speech perception test (Chear Auditory Perception Test [CAPT]), provide an appropriate method for evaluating speech perception in the classroom; (2) Whether SFA provides better access to the teacher's speech than without SFA for children, taking into account vocabulary age, middle ear dysfunction or ear-canal wax, and home language. Forty-four children from two school-year groups, year 2 (aged 6 years 11 months to 7 years 10 months) and year 3 (aged 7 years 11 months to 8 years 10 months) were tested in two classrooms, using a shortened version of the four-alternative consonant discrimination section of the CAPT. All children used a PRS to register their chosen response, which they selected from four options displayed on the interactive whiteboard. The classrooms were located in a 19th-century school in central London, United Kingdom. Each child sat at their usual position in the room while target speech stimuli were presented either in quiet or in noise. The target speech was presented from the front of the classroom at 65 dBA (calibrated at 1 m) and the presented noise level was 46 dBA measured at the center of the classroom. The older children had an additional noise condition with a noise level of 52 dBA. All conditions were presented twice, once with SFA and once without SFA and the order of testing was randomized. White noise from the teacher's right-hand side of the classroom and International Speech Test Signal from the teacher's left-hand side were used, and the noises were matched at the center point of the classroom (10sec averaging [A-weighted]). Each child's expressive vocabulary age and middle ear status were measured
Full Text Available Natural sounds, including vocal communication sounds, contain critical information at multiple time scales. Two essential temporal modulation rates in speech have been argued to be in the low gamma band (~20-80 ms duration information and the theta band (~150-300 ms, corresponding to segmental and syllabic modulation rates, respectively. On one hypothesis, auditory cortex implements temporal integration using time constants closely related to these values. The neural correlates of a proposed dual temporal window mechanism in human auditory cortex remain poorly understood. We recorded MEG responses from participants listening to non-speech auditory stimuli with different temporal structures, created by concatenating frequency-modulated segments of varied segment durations. We show that these non-speech stimuli with temporal structure matching speech-relevant scales (~25 ms and ~200 ms elicit reliable phase tracking in the corresponding associated oscillatory frequencies (low gamma and theta bands. In contrast, stimuli with non-matching temporal structure do not. Furthermore, the topography of theta band phase tracking shows rightward lateralization while gamma band phase tracking occurs bilaterally. The results support the hypothesis that there exists multi-time resolution processing in cortex on discontinuous scales and provide evidence for an asymmetric organization of temporal analysis (asymmetrical sampling in time, AST. The data argue for a macroscopic-level neural mechanism underlying multi-time resolution processing: the sliding and resetting of intrinsic temporal windows on privileged time scales.
Cabrera, Laurianne; Bertoncini, Josiane; Lorenzi, Christian
Purpose: The capacity of 6-month-old infants to discriminate a voicing contrast (/aba/--/apa/) on the basis of "amplitude modulation (AM) cues" and "frequency modulation (FM) cues" was evaluated. Method: Several vocoded speech conditions were designed to either degrade FM cues in 4 or 32 bands or degrade AM in 32 bands. Infants…
Dankovicova, Jana; Hunt, Claire
Foreign accent syndrome (FAS) is an acquired neurogenic disorder characterized by altered speech that sounds foreign-accented. This study presents a British subject perceived to speak with an Italian (or Greek) accent after a brainstem (pontine) stroke. Native English listeners rated the strength of foreign accent and impairment they perceived in…
In difficult listening situations, e.g. in a cocktail party scenario, speech signal that a listener is interested in decoding is often masked by unwanted noise, disrupting bottom-up signalling. A normal hearing (NH) listener is able to withstand a reasonable amount of such disruption and can still
Bryan, Karen; Gregory, Juliette
The purpose of this research was to ascertain the views of staff and managers within a youth offending team on their experiences of working with a speech and language therapist (SLT). The model of therapy provision was similar to the whole-systems approach used in schools. The impact of the service on language outcomes is reported elsewhere…
Zekveld, A.A.; Kramer, S.E.; Kessens, J.M.; Vlaming, M.S.M.G.; Houtgast, T.
OBJECTIVES: The aim of this study was to evaluate the benefit that listeners obtain from visually presented output from an automatic speech recognition (ASR) system during listening to speech in noise. DESIGN: Auditory-alone and audiovisual speech reception thresholds (SRTs) were measured. The SRT
Murakami, Takenobu; Restle, Julia; Ziemann, Ulf
A left-hemispheric cortico-cortical network involving areas of the temporoparietal junction (Tpj) and the posterior inferior frontal gyrus (pIFG) is thought to support sensorimotor integration of speech perception into articulatory motor activation, but how this network links with the lip area of the primary motor cortex (M1) during speech…
van der Jagt, M Annerie; Briaire, Jeroen J; Verbist, Berit M; Frijns, Johan H M
The HiFocus Mid-Scala (MS) electrode array has recently been introduced onto the market. This precurved design with a targeted mid-scalar intracochlear position pursues an atraumatic insertion and optimal distance for neural stimulation. In this study we prospectively examined the angular insertion depth achieved and speech perception outcomes resulting from the HiFocus MS electrode array for 6 months after implantation, and retrospectively compared these with the HiFocus 1J lateral wall electrode array. The mean angular insertion depth within the MS population (n = 96) was found at 470°. This was 50° shallower but more consistent than the 1J electrode array (n = 110). Audiological evaluation within a subgroup, including only postlingual, unilaterally implanted, adult cochlear implant recipients who were matched on preoperative speech perception scores and the duration of deafness (MS = 32, 1J = 32), showed no difference in speech perception outcomes between the MS and 1J groups. Furthermore, speech perception outcome was not affected by the angular insertion depth or frequency mismatch. © 2016 S. Karger AG, Basel.
Al-Dawaideh, Ahmad Mousa
Speech-language pathologists (SLPs) frequently work with people with severe communication disorders who require assistive technology (AT) for communication. The purpose of this study was to investigate the SLPs perceptions of the importance of and ability level required for using AT, and the relationship of AT with gender, level of education,…
Loucas, Tom; Riches, Nick Greatorex; Charman, Tony; Pickles, Andrew; Simonoff, Emily; Chandler, Susie; Baird, Gillian
Background: The cognitive bases of language impairment in specific language impairment (SLI) and autism spectrum disorders (ASD) were investigated in a novel non-word comparison task which manipulated phonological short-term memory (PSTM) and speech perception, both implicated in poor non-word repetition. Aims: This study aimed to investigate the…
Full Text Available This article discusses the connection of hemispheric control over audioverbal perception processes and such individual features as “leading hand” (right-handedness and lefthandedness. We present a literature review and description of our research to provide evidence of the complexity and ambiguity of this connection. The method of dichotic listening was used for diagnosing audioverbal perception lateralization. This method allows estimation of the right-ear coefficient (REC, the efficiency coefficient (EC, and the effectiveness ratio (ER of different aspects of audioverbal perception. Our research involved 47 persons with a leading right hand (mean age, 29.04±9.97 years and 32 persons with a leading left hand (mean age, 29.41±10.34 years. Different hypotheses about the mechanisms of hemispheric control over audioverbal and motor processes were assessed. The research showed that both the leftand right-handers’ audioverbal perception characteristics depended mainly on right-hemisphere activity. The most dynamic and sensitive index of the functioning of the two hemispheres during dichotic listening was the efficiency coefficient of stimuli reproduction through the left ear (EC of the left ear. It turns out that this index depends on the coincidence/noncoincidence of the leading hemispheres in speech and motor processes. The highest efficiency of audioverbal perception revealed itself in the left-handers with a leading left ear (the hemispheric-control coincidence, and the lowest efficiency was in the left-handers with a leading right ear (the hemispheric-control divergence. The right-handers were characterized by less variation in values, although the influence of the coincidence/noncoincidence of the leading hemispheres in speech and motor processes also revealed itself as a tendency. This consistent pattern points out the necessity for further research on asymmetries of the different modalities that takes into account their probable
Tse, Chun-Yu; Gratton, Gabriele; Garnsey, Susan M; Novak, Michael A; Fabiani, Monica
Information from different modalities is initially processed in different brain areas, yet real-world perception often requires the integration of multisensory signals into a single percept. An example is the McGurk effect, in which people viewing a speaker whose lip movements do not match the utterance perceive the spoken sounds incorrectly, hearing them as more similar to those signaled by the visual rather than the auditory input. This indicates that audiovisual integration is important for generating the phoneme percept. Here we asked when and where the audiovisual integration process occurs, providing spatial and temporal boundaries for the processes generating phoneme perception. Specifically, we wanted to separate audiovisual integration from other processes, such as simple deviance detection. Building on previous work employing ERPs, we used an oddball paradigm in which task-irrelevant audiovisually deviant stimuli were embedded in strings of non-deviant stimuli. We also recorded the event-related optical signal, an imaging method combining spatial and temporal resolution, to investigate the time course and neuroanatomical substrate of audiovisual integration. We found that audiovisual deviants elicit a short duration response in the middle/superior temporal gyrus, whereas audiovisual integration elicits a more extended response involving also inferior frontal and occipital regions. Interactions between audiovisual integration and deviance detection processes were observed in the posterior/superior temporal gyrus. These data suggest that dynamic interactions between inferior frontal cortex and sensory regions play a significant role in multimodal integration.
Zhang, Chenyi; Akagi, Masato
Without understanding of one language, emotional contents of a voice can still be judged by human beings. However, it is reported that differences occur in emotion perception among listeners with different mother languages. Investigating reasons that differences occur may provide a systematic method to the discussions of emotion perception in a cross-language scenario. Therefore, this study discusses commonalities and differences between emotion perception of Japanese and Chinese listeners fo...
Methods: New graduates of six South African universities were recruited to participate in a survey by completing an electronic questionnaire exploring their perceptions of the dysphagia curricula and their preparedness to practise across the scope of the profession of speechlanguage therapy. Results: Eighty graduates participated in the study yielding a response rate of 63.49%. Participants perceived themselves to be well prepared in some areas (e.g. child language: 100%; articulation and phonology: 97.26%, but less prepared in other areas (e.g. adult dysphagia: 50.70%; paediatric dysarthria: 46.58%; paediatric dysphagia: 38.36% and most unprepared to provide services requiring sign language (23.61% and African languages (20.55%. There was a significant relationship between perceptions of adequate theory and clinical learning opportunities with assessment and management of dysphagia and perceptions of preparedness to provide dysphagia services. Conclusion: There is a need for review of existing curricula and consideration of developing a standard speech-language therapy curriculum across universities, particularly in service provision to a multilingual population, and in both the theory and clinical learning of the assessment and management of adult and paediatric dysphagia, to better equip graduates for practice.
Joey L. Weidema
Full Text Available Whether pitch in language and music is governed by domain-specific or domain-general cognitive mechanisms is contentiously debated. The aim of the present study was to investigate whether mechanisms governing pitch contour perception operate differently when pitch information is interpreted as either speech or music. By modulating listening mode, this study aspired to demonstrate that pitch contour perception relies on domain-specific cognitive mechanisms, which are regulated by top-down influences from language and music. Three groups of participants (Mandarin speakers, Dutch speaking non-musicians, and Dutch musicians were exposed to identical pitch contours, and tested on their ability to identify these contours in a language and musical context. Stimuli consisted of disyllabic words spoken in Mandarin, and melodic tonal analogues, embedded in a linguistic and melodic carrier phrase, respectively. Participants classified identical pitch contours as significantly different depending on listening mode. Top-down influences from language appeared to alter the perception of pitch contour in speakers of Mandarin. This was not the case for non-musician speakers of Dutch. Moreover, this effect was lacking in Dutch speaking musicians. The classification patterns of pitch contours in language and music seem to suggest that domain-specific categorization is modulated by top-down influences from language and music.
Shetty, Hemanth Narayan; Koonoor, Vishal
Past research has reported that children with repeated occurrences of otitis media at an early age have a negative impact on speech perception at a later age. The present study necessitates documenting the temporal and spectral processing on speech perception in noise from normal and atypical groups. The present study evaluated the relation between speech perception in noise and temporal; and spectral processing abilities in children with normal and atypical groups. The study included two experiments. In the first experiment, temporal resolution and frequency discrimination of listeners with normal group and three subgroups of atypical groups (had a history of OM) a) less than four episodes b) four to nine episodes and c) More than nine episodes during their chronological age of 6 months to 2 years) were evaluated using measures of temporal modulation transfer function and frequency discrimination test. In the second experiment, SNR 50 was evaluated on each group of study participants. All participants had normal hearing and middle ear status during the course of testing. Demonstrated that children with atypical group had significantly poorer modulation detection threshold, peak sensitivity and bandwidth; and frequency discrimination to each F0 than normal hearing listeners. Furthermore, there was a significant correlation seen between measures of temporal resolution; frequency discrimination and speech perception in noise. It infers atypical groups have significant impairment in extracting envelope as well as fine structure cues from the signal. The results supported the idea that episodes of OM before 2 years of agecan produce periods of sensory deprivation that alters the temporal and spectral skills which in turn has negative consequences on speech perception in noise. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Kummerer, Sharon E; Lopez-Reyna, Norma A; Hughes, Marie Tejero
This qualitative study explored mothers' perceptions of their children's communication disabilities, emergent literacy development, and speech-language therapy programs. Participants were 14 Mexican immigrant mothers and their children (age 17-47 months) who were receiving center-based services from an early childhood intervention program, located in a large urban city in the Midwestern United States. Mother interviews composed the primary source of data. A secondary source of data included children's therapy files and log notes. Following the analysis of interviews through the constant comparative method, grounded theory was generated. The majority of mothers perceived their children as exhibiting a communication delay. Causal attributions were diverse and generally medical in nature (i.e., ear infections, seizures) or due to familial factors (i.e., family history and heredity, lack of extended family). Overall, mothers seemed more focused on their children's speech intelligibility and/or expressive language in comparison to emergent literacy abilities. To promote culturally responsive intervention, mothers recommended that professionals speak Spanish, provide information about the therapy process, and use existing techniques with Mexican immigrant families.
Parker, Norton S.
In audiovisual writing the writer must first learn to think in terms of moving visual presentation. The writer must research his script, organize it, and adapt it to a limited running time. By use of a pleasant-sounding narrator and well-written narration, the visual and narrative can be successfully integrated. There are two types of script…
Schlueter, Anne; Brand, Thomas; Lemke, Ulrike; Nitzschner, Stefan; Kollmeier, Birger; Holube, Inga
Positive signal-to-noise ratios (SNRs) characterize listening situations most relevant for hearing-impaired listeners in daily life and should therefore be considered when evaluating hearing aid algorithms. For this, a speech-in-noise test was developed and evaluated, in which the background noise is presented at fixed positive SNRs and the speech rate (i.e., the time compression of the speech material) is adaptively adjusted. In total, 29 younger and 12 older normal-hearing, as well as 24 older hearing-impaired listeners took part in repeated measurements. Younger normal-hearing and older hearing-impaired listeners conducted one of two adaptive methods which differed in adaptive procedure and step size. Analysis of the measurements with regard to list length and estimation strategy for thresholds resulted in a practical method measuring the time compression for 50% recognition. This method uses time-compression adjustment and step sizes according to Versfeld and Dreschler [(2002). J. Acoust. Soc. Am. 111, 401-408], with sentence scoring, lists of 30 sentences, and a maximum likelihood method for threshold estimation. Evaluation of the procedure showed that older participants obtained higher test-retest reliability compared to younger participants. Depending on the group of listeners, one or two lists are required for training prior to data collection.
Bárbara Cristiane Sordi Silva
Full Text Available Abstract Introduction: Diabetes mellitus (DM is a chronic metabolic disorder of various origins that occurs when the pancreas fails to produce insulin in sufficient quantities or when the organism fails to respond to this hormone in an efficient manner. Objective: To evaluate the speech recognition in subjects with type I diabetes mellitus (DMI in quiet and in competitive noise. Methods: It was a descriptive, observational and cross-section study. We included 40 participants of both genders aged 18-30 years, divided into a control group (CG of 20 healthy subjects with no complaints or auditory changes, paired for age and gender with the study group, consisting of 20 subjects with a diagnosis of DMI. First, we applied basic audiological evaluations (pure tone audiometry, speech audiometry and immittance audiometry for all subjects; after these evaluations, we applied Sentence Recognition Threshold in Quiet (SRTQ and Sentence Recognition Threshold in Noise (SRTN in free field, using the List of Sentences in Portuguese test. Results: All subjects showed normal bilateral pure tone threshold, compatible speech audiometry and "A" tympanometry curve. Group comparison revealed a statistically significant difference for SRTQ (p = 0.0001, SRTN (p < 0.0001 and the signal-to-noise ratio (p < 0.0001. Conclusion: The performance of DMI subjects in SRTQ and SRTN was worse compared to the subjects without diabetes.