Yeung, H Henny; Werker, Janet F
Speech is robustly audiovisual from early in infancy. Here we show that audiovisual speech perception in 4.5-month-old infants is influenced by sensorimotor information related to the lip movements they make while chewing or sucking. Experiment 1 consisted of a classic audiovisual matching procedure, in which two simultaneously displayed talking faces (visual [i] and [u]) were presented with a synchronous vowel sound (audio /i/ or /u/). Infants' looking patterns were selectively biased away from the audiovisual matching face when the infants were producing lip movements similar to those needed to produce the heard vowel. Infants' looking patterns returned to those of a baseline condition (no lip movements, looking longer at the audiovisual matching face) when they were producing lip movements that did not match the heard vowel. Experiment 2 confirmed that these sensorimotor effects interacted with the heard vowel, as looking patterns differed when infants produced these same lip movements while seeing and hearing a talking face producing an unrelated vowel (audio /a/). These findings suggest that the development of speech perception and speech production may be mutually informative.
Audiovisual information is integrated in speech perception. One manifestation of this is the McGurk illusion in which watching the articulating face alters the auditory phonetic percept. Understanding this phenomenon fully requires a computational model with predictive power. Here, we describe...... ordinal models that can account for the McGurk illusion. We compare this type of models to the Fuzzy Logical Model of Perception (FLMP) in which the response categories are not ordered. While the FLMP generally fit the data better than the ordinal model it also employs more free parameters in complex...... experiments when the number of response categories are high as it is for speech perception in general. Testing the predictive power of the models using a form of cross-validation we found that ordinal models perform better than the FLMP. Based on these findings we suggest that ordinal models generally have...
Vroomen, Jean; Stekelenburg, Jeroen J.
Perception of intersensory temporal order is particularly difficult for (continuous) audiovisual speech, as perceivers may find it difficult to notice substantial timing differences between speech sounds and lip movements. Here we tested whether this occurs because audiovisual speech is strongly paired ("unity assumption"). Participants made…
Eskelund, Kasper; Dau, Torsten
Speech perception integrates signal from ear and eye. This is witnessed by a wide range of audiovisual integration effects, such as ventriloquism and the McGurk illusion. Some behavioral evidence suggest that audiovisual integration of specific aspects is special for speech perception. However, o...
Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias
investigate whether the integration of auditory and visual speech observed in these two audiovisual integration effects are specific traits of speech perception. We further ask whether audiovisual integration is undertaken in a single processing stage or multiple processing stages.......Integration of speech signals from ear and eye is a well-known feature of speech perception. This is evidenced by the McGurk illusion in which visual speech alters auditory speech perception and by the advantage observed in auditory speech detection when a visual signal is present. Here we...
Andersen, Tobias; Tiippana, K.; Laarni, J.
integration did not change. Visual spatial attention was also able to select between the faces when lip reading. This suggests that visual spatial attention acts at the level of visual speech perception prior to audiovisual integration and that the effect propagates through audiovisual integration......Auditory and visual information is integrated when perceiving speech, as evidenced by the McGurk effect in which viewing an incongruent talking face categorically alters auditory speech perception. Audiovisual integration in speech perception has long been considered automatic and pre-attentive...... but recent reports have challenged this view. Here we study the effect of visual spatial attention on the McGurk effect. By presenting a movie of two faces symmetrically displaced to each side of a central fixation point and dubbed with a single auditory speech track, we were able to discern the influences...
Eskelund, Kasper; Andersen, Tobias
Speech perception is audiovisual as evidenced by bimodal integration in the McGurk effect. This integration effect may be specific to speech or be applied to all stimuli in general. To investigate this, Tuomainen et al. (2005) used sine-wave speech, which naÃ¯ve observers may perceive as non-speech......, but hear as speech once informed of the linguistic origin of the signal. Combinations of sine-wave speech and incongruent video of the talker elicited a McGurk effect only for informed observers. This indicates that the audiovisual integration effect is specific to speech perception. However, observers...... of the speaker. Observers were required to report this after primary target categorization. We found a significant McGurk effect only in the natural speech and speech mode conditions supporting the finding of Tuomainen et al. Performance in the secondary task was similar in all conditions indicating...
Asakawa, Kaori; Tanaka, Akihiro; Imai, Hisato
We investigated whether audiovisual synchrony perception for speech could change after observation of the audiovisual temporal mismatch. Previous studies have revealed that audiovisual synchrony perception is re-calibrated after exposure to a constant timing difference between auditory and visual signals in non-speech. In the present study, we examined whether this audiovisual temporal recalibration occurs at the perceptual level even for speech (monosyllables). In Experiment 1, participants performed an audiovisual simultaneity judgment task (i.e., a direct measurement of the audiovisual synchrony perception) in terms of the speech signal after observation of the speech stimuli which had a constant audiovisual lag. The results showed that the “simultaneous” responses (i.e., proportion of responses for which participants judged the auditory and visual stimuli to be synchronous) at least partly depended on exposure lag. In Experiment 2, we adopted the McGurk identification task (i.e., an indirect measurement of the audiovisual synchrony perception) to exclude the possibility that this modulation of synchrony perception was solely attributable to the response strategy using stimuli identical to those of Experiment 1. The characteristics of the McGurk effect reported by participants depended on exposure lag. Thus, it was shown that audiovisual synchrony perception for speech could be modulated following exposure to constant lag both in direct and indirect measurement. Our results suggest that temporal recalibration occurs not only in non-speech signals but also in monosyllabic speech at the perceptual level.
Jaekl, Philip; Pesquita, Ana; Alsius, Agnes; Munhall, Kevin; Soto-Faraco, Salvador
Seeing a speaker's facial gestures can significantly improve speech comprehension, especially in noisy environments. However, the nature of the visual information from the speaker's facial movements that is relevant for this enhancement is still unclear. Like auditory speech signals, visual speech signals unfold over time and contain both dynamic configural information and luminance-defined local motion cues; two information sources that are thought to engage anatomically and functionally separate visual systems. Whereas, some past studies have highlighted the importance of local, luminance-defined motion cues in audiovisual speech perception, the contribution of dynamic configural information signalling changes in form over time has not yet been assessed. We therefore attempted to single out the contribution of dynamic configural information to audiovisual speech processing. To this aim, we measured word identification performance in noise using unimodal auditory stimuli, and with audiovisual stimuli. In the audiovisual condition, speaking faces were presented as point light displays achieved via motion capture of the original talker. Point light displays could be isoluminant, to minimise the contribution of effective luminance-defined local motion information, or with added luminance contrast, allowing the combined effect of dynamic configural cues and local motion cues. Audiovisual enhancement was found in both the isoluminant and contrast-based luminance conditions compared to an auditory-only condition, demonstrating, for the first time the specific contribution of dynamic configural cues to audiovisual speech improvement. These findings imply that globally processed changes in a speaker's facial shape contribute significantly towards the perception of articulatory gestures and the analysis of audiovisual speech.
Stevenson, Ryan A.; Siemann, Justin K.; Woynaroski, Tiffany G.; Schneider, Brittany C.; Eberly, Haley E.; Camarata, Stephen M.; Wallace, Mark T.
Atypical communicative abilities are a core marker of Autism Spectrum Disorders (ASD). A number of studies have shown that, in addition to auditory comprehension differences, individuals with autism frequently show atypical responses to audiovisual speech, suggesting a multisensory contribution to these communicative differences from their typically developing peers. To shed light on possible differences in the maturation of audiovisual speech integration, we tested younger (ages 6-12) and older (ages 13-18) children with and without ASD on a task indexing such multisensory integration. To do this, we used the McGurk effect, in which the pairing of incongruent auditory and visual speech tokens typically results in the perception of a fused percept distinct from the auditory and visual signals, indicative of active integration of the two channels conveying speech information. Whereas little difference was seen in audiovisual speech processing (i.e., reports of McGurk fusion) between the younger ASD and TD groups, there was a significant difference at the older ages. While TD controls exhibited an increased rate of fusion (i.e., integration) with age, children with ASD failed to show this increase. These data suggest arrested development of audiovisual speech integration in ASD. The results are discussed in light of the extant literature and necessary next steps in research. PMID:24218241
Getz, Laura M.; Nordeen, Elke R.; Vrabic, Sarah C.; Toscano, Joseph C.
Adult speech perception is generally enhanced when information is provided from multiple modalities. In contrast, infants do not appear to benefit from combining auditory and visual speech information early in development. This is true despite the fact that both modalities are important to speech comprehension even at early stages of language acquisition. How then do listeners learn how to process auditory and visual information as part of a unified signal? In the auditory domain, statistical learning processes provide an excellent mechanism for acquiring phonological categories. Is this also true for the more complex problem of acquiring audiovisual correspondences, which require the learner to integrate information from multiple modalities? In this paper, we present simulations using Gaussian mixture models (GMMs) that learn cue weights and combine cues on the basis of their distributional statistics. First, we simulate the developmental process of acquiring phonological categories from auditory and visual cues, asking whether simple statistical learning approaches are sufficient for learning multi-modal representations. Second, we use this time course information to explain audiovisual speech perception in adult perceivers, including cases where auditory and visual input are mismatched. Overall, we find that domain-general statistical learning techniques allow us to model the developmental trajectory of audiovisual cue integration in speech, and in turn, allow us to better understand the mechanisms that give rise to unified percepts based on multiple cues. PMID:28335558
Saalasti, Satu; Katsyri, Jari; Tiippana, Kaisa; Laine-Hernandez, Mari; von Wendt, Lennart; Sams, Mikko
Audiovisual speech perception was studied in adults with Asperger syndrome (AS), by utilizing the McGurk effect, in which conflicting visual articulation alters the perception of heard speech. The AS group perceived the audiovisual stimuli differently from age, sex and IQ matched controls. When a voice saying /p/ was presented with a face…
Meronen, Auli; Tiippana, Kaisa; Westerholm, Jari; Ahonen, Timo
Purpose: The effect of the signal-to-noise ratio (SNR) on the perception of audiovisual speech in children with and without developmental language disorder (DLD) was investigated by varying the noise level and the sound intensity of acoustic speech. The main hypotheses were that the McGurk effect (in which incongruent visual speech alters the…
Alsius, Agnès; Möttönen, Riikka; Sams, Mikko E; Soto-Faraco, Salvador; Tiippana, Kaisa
Seeing articulatory movements influences perception of auditory speech. This is often reflected in a shortened latency of auditory event-related potentials (ERPs) generated in the auditory cortex. The present study addressed whether this early neural correlate of audiovisual interaction is modulated by attention. We recorded ERPs in 15 subjects while they were presented with auditory, visual, and audiovisual spoken syllables. Audiovisual stimuli consisted of incongruent auditory and visual components known to elicit a McGurk effect, i.e., a visually driven alteration in the auditory speech percept. In a Dual task condition, participants were asked to identify spoken syllables whilst monitoring a rapid visual stream of pictures for targets, i.e., they had to divide their attention. In a Single task condition, participants identified the syllables without any other tasks, i.e., they were asked to ignore the pictures and focus their attention fully on the spoken syllables. The McGurk effect was weaker in the Dual task than in the Single task condition, indicating an effect of attentional load on audiovisual speech perception. Early auditory ERP components, N1 and P2, peaked earlier to audiovisual stimuli than to auditory stimuli when attention was fully focused on syllables, indicating neurophysiological audiovisual interaction. This latency decrement was reduced when attention was loaded, suggesting that attention influences early neural processing of audiovisual speech. We conclude that reduced attention weakens the interaction between vision and audition in speech.
Full Text Available Seeing articulatory movements influences perception of auditory speech. This is often reflected in a shortened latency of auditory event-related potentials (ERPs generated in the auditory cortex. The present study addressed whether this early neural correlate of audiovisual interaction is modulated by attention. We recorded ERPs in 15 subjects while they were presented with auditory, visual and audiovisual spoken syllables. Audiovisual stimuli consisted of incongruent auditory and visual components known to elicit a McGurk effect, i.e. a visually driven alteration in the auditory speech percept. In a Dual task condition, participants were asked to identify spoken syllables whilst monitoring a rapid visual stream of pictures for targets, i.e., they had to divide their attention. In a Single task condition, participants identified the syllables without any other tasks, i.e., they were asked to ignore the pictures and focus their attention fully on the spoken syllables. The McGurk effect was weaker in the Dual task than in the Single task condition, indicating an effect of attentional load on audiovisual speech perception. Early auditory ERP components, N1 and P2, peaked earlier to audiovisual stimuli than to auditory stimuli when attention was fully focused on syllables, indicating neurophysiological audiovisual interaction. This latency decrement was reduced when attention was loaded, suggesting that attention influences early neural processing of audiovisual speech. We conclude that reduced attention weakens the interaction between vision and audition in speech.
Evitts, Paul M; Van Dine, Ami; Holler, Aline
There is minimal research on listener perceptions of an individual with a laryngectomy (IWL) based on audio-visual information. The aim of this research was to provide preliminary insight into whether listeners have different perceptions of an individual with a laryngectomy based on mode of presentation (audio-only vs. audio-visual) and mode of speech (tracheoesophageal, oesophageal, electrolaryngeal, normal). Thirty-four naïve listeners were randomly presented with a standard reading passage produced by one typical speaker from each mode of speech in both audio-only and audio-visual presentation mode. Listeners used a visual analogue scale (10 cm line) to indicate their perceptions of each speaker's personality. A significant effect for mode of speech was present. There was no significant difference in listener perceptions between mode of presentation using individual ratings. However, principal component analysis showed ratings were more favourable in the audio-visual mode. Results of this study suggest that visual information may only have a minor impact on listener perceptions of a speakers' personality and that mode of speech and degree of speech proficiency may only play a small role in listener perceptions. However, results should be interpreted with caution as results are based on only one speaker per mode of speech.
Knowland, Victoria C. P.; Mercure, Evelyne; Karmiloff-Smith, Annette; Dick, Fred; Thomas, Michael S. C.
Being able to see a talking face confers a considerable advantage for speech perception in adulthood. However, behavioural data currently suggest that children fail to make full use of these available visual speech cues until age 8 or 9. This is particularly surprising given the potential utility of multiple informational cues during language…
Hisanaga, Satoko; Sekiyama, Kaoru; Igasaki, Tomohiko; Murayama, Nobuki
Several behavioural studies have shown that the interplay between voice and face information in audiovisual speech perception is not universal. Native English speakers (ESs) are influenced by visual mouth movement to a greater degree than native Japanese speakers (JSs) when listening to speech. However, the biological basis of these group differences is unknown. Here, we demonstrate the time-varying processes of group differences in terms of event-related brain potentials (ERP) and eye gaze for audiovisual and audio-only speech perception. On a behavioural level, while congruent mouth movement shortened the ESs’ response time for speech perception, the opposite effect was observed in JSs. Eye-tracking data revealed a gaze bias to the mouth for the ESs but not the JSs, especially before the audio onset. Additionally, the ERP P2 amplitude indicated that ESs processed multisensory speech more efficiently than auditory-only speech; however, the JSs exhibited the opposite pattern. Taken together, the ESs’ early visual attention to the mouth was likely to promote phonetic anticipation, which was not the case for the JSs. These results clearly indicate the impact of language and/or culture on multisensory speech processing, suggesting that linguistic/cultural experiences lead to the development of unique neural systems for audiovisual speech perception. PMID:27734953
Altvater-Mackensen, Nicole; Mani, Nivedita; Grossmann, Tobias
Recent studies suggest that infants' audiovisual speech perception is influenced by articulatory experience (Mugitani et al., 2008; Yeung & Werker, 2013). The current study extends these findings by testing if infants' emerging ability to produce native sounds in babbling impacts their audiovisual speech perception. We tested 44 6-month-olds…
Full Text Available According to recent ERP (event-related potentials studies, the visual speech facilitates the neural processing of auditory speech for speakers of European languages in audiovisual speech perception. We examined whether this visual facilitation is also the case for Japanese speakers for whom the weaker susceptibility of the visual influence has been behaviorally reported. We conducted a cross-linguistic experiment comparing ERPs of Japanese and English language groups (JL and EL when they were presented with audiovisual congruent as well as audio-only speech stimuli. The temporal facilitation by the additional visual speech was observed only for native speech stimuli, suggesting a role of articulating experiences for early ERP components. For native stimuli, the EL showed sustained visual facilitation for about 300 ms from audio onset. On the other hand, the visual facilitation was limited to the first 100 ms for the JL, and they rather showed a visual inhibitory effect at 300 ms from the audio onset. Thus the type of native language affects neural processing of visual speech in audiovisual speech perception. This inhibition is consistent with behaviorally reported weaker visual influence for the JL.
Venezia, Jonathan H; Thurman, Steven M; Matchin, William; George, Sahara E; Hickok, Gregory
Recent influential models of audiovisual speech perception suggest that visual speech aids perception by generating predictions about the identity of upcoming speech sounds. These models place stock in the assumption that visual speech leads auditory speech in time. However, it is unclear whether and to what extent temporally-leading visual speech information contributes to perception. Previous studies exploring audiovisual-speech timing have relied upon psychophysical procedures that require artificial manipulation of cross-modal alignment or stimulus duration. We introduce a classification procedure that tracks perceptually relevant visual speech information in time without requiring such manipulations. Participants were shown videos of a McGurk syllable (auditory /apa/ + visual /aka/ = perceptual /ata/) and asked to perform phoneme identification (/apa/ yes-no). The mouth region of the visual stimulus was overlaid with a dynamic transparency mask that obscured visual speech in some frames but not others randomly across trials. Variability in participants' responses (~35 % identification of /apa/ compared to ~5 % in the absence of the masker) served as the basis for classification analysis. The outcome was a high resolution spatiotemporal map of perceptually relevant visual features. We produced these maps for McGurk stimuli at different audiovisual temporal offsets (natural timing, 50-ms visual lead, and 100-ms visual lead). Briefly, temporally-leading (~130 ms) visual information did influence auditory perception. Moreover, several visual features influenced perception of a single speech sound, with the relative influence of each feature depending on both its temporal relation to the auditory signal and its informational content.
Irwin, Julia; Preston, Jonathan; Brancazio, Lawrence; D'angelo, Michael; Turcios, Jacqueline
Perception of spoken language requires attention to acoustic as well as visible phonetic information. This article reviews the known differences in audiovisual speech perception in children with autism spectrum disorders (ASD) and specifies the need for interventions that address this construct. Elements of an audiovisual training program are described. This researcher-developed program delivered via an iPad app presents natural speech in the context of increasing noise, but supported with a speaking face. Children are cued to attend to visible articulatory information to assist in perception of the spoken words. Data from four children with ASD ages 8-10 are presented showing that the children improved their performance on an untrained auditory speech-in-noise task.
Alsius, Agnès; Wayne, Rachel V; Paré, Martin; Munhall, Kevin G
The basis for individual differences in the degree to which visual speech input enhances comprehension of acoustically degraded speech is largely unknown. Previous research indicates that fine facial detail is not critical for visual enhancement when auditory information is available; however, these studies did not examine individual differences in ability to make use of fine facial detail in relation to audiovisual speech perception ability. Here, we compare participants based on their ability to benefit from visual speech information in the presence of an auditory signal degraded with noise, modulating the resolution of the visual signal through low-pass spatial frequency filtering and monitoring gaze behavior. Participants who benefited most from the addition of visual information (high visual gain) were more adversely affected by the removal of high spatial frequency information, compared to participants with low visual gain, for materials with both poor and rich contextual cues (i.e., words and sentences, respectively). Differences as a function of gaze behavior between participants with the highest and lowest visual gains were observed only for words, with participants with the highest visual gain fixating longer on the mouth region. Our results indicate that the individual variance in audiovisual speech in noise performance can be accounted for, in part, by better use of fine facial detail information extracted from the visual signal and increased fixation on mouth regions for short stimuli. Thus, for some, audiovisual speech perception may suffer when the visual input (in addition to the auditory signal) is less than perfect.
Kim, Heejung; Hahm, Jarang; Lee, Hyekyoung; Kang, Eunjoo; Kang, Hyejin; Lee, Dong Soo
The human brain naturally integrates audiovisual information to improve speech perception. However, in noisy environments, understanding speech is difficult and may require much effort. Although the brain network is supposed to be engaged in speech perception, it is unclear how speech-related brain regions are connected during natural bimodal audiovisual or unimodal speech perception with counterpart irrelevant noise. To investigate the topological changes of speech-related brain networks at all possible thresholds, we used a persistent homological framework through hierarchical clustering, such as single linkage distance, to analyze the connected component of the functional network during speech perception using functional magnetic resonance imaging. For speech perception, bimodal (audio-visual speech cue) or unimodal speech cues with counterpart irrelevant noise (auditory white-noise or visual gum-chewing) were delivered to 15 subjects. In terms of positive relationship, similar connected components were observed in bimodal and unimodal speech conditions during filtration. However, during speech perception by congruent audiovisual stimuli, the tighter couplings of left anterior temporal gyrus-anterior insula component and right premotor-visual components were observed than auditory or visual speech cue conditions, respectively. Interestingly, visual speech is perceived under white noise by tight negative coupling in the left inferior frontal region-right anterior cingulate, left anterior insula, and bilateral visual regions, including right middle temporal gyrus, right fusiform components. In conclusion, the speech brain network is tightly positively or negatively connected, and can reflect efficient or effortful processes during natural audiovisual integration or lip-reading, respectively, in speech perception.
Full Text Available Gender and age have been found to affect adults’ audio-visual (AV speech perception. However, research on adult aging focuses on adults over 60 years, who have an increasing likelihood for cognitive and sensory decline, which may confound positive effects of age-related AV-experience and its interaction with gender. Observed age and gender differences in AV speech perception may also depend on measurement sensitivity and AV task difficulty. Consequently both AV benefit and visual influence were used to measure visual contribution for gender-balanced groups of young (20-30 years and middle-aged adults (50-60 years with task difficulty varied using AV syllables from different talkers in alternative auditory backgrounds. Females had better speech-reading performance than males. Whereas no gender differences in AV benefit or visual influence were observed for young adults, visually influenced responses were significantly greater for middle-aged females than middle-aged males. That speech-reading performance did not influence AV benefit may be explained by visual speech extraction and AV integration constituting independent abilities. Contrastingly, the gender difference in visually influenced responses in middle adulthood may reflect an experience-related shift in females’ general AV perceptual strategy. Although young females’ speech-reading proficiency may not readily contribute to greater visual influence, between young and middle-adulthood recurrent confirmation of the contribution of visual cues induced by speech-reading proficiency may gradually shift females AV perceptual strategy towards more visually dominated responses.
Speech perception is facilitated by seeing the articulatory mouth movements of the talker. This is due to perceptual audiovisual integration, which also causes the McGurk−MacDonald illusion, and for which a comprehensive computational account is still lacking. Decades of research have largely...... focused on the fuzzy logical model of perception (FLMP), which provides excellent fits to experimental observations but also has been criticized for being too flexible, post hoc and difficult to interpret. The current study introduces the early maximum likelihood estimation (MLE) model of audiovisual......-validation can evaluate models of audiovisual integration based on typical data sets taking both goodness-of-fit and model flexibility into account. All models were tested on a published data set previously used for testing the FLMP. Cross-validation favored the early MLE while more conventional error measures...
Eskelund, Kasper; Andersen, Tobias
-like nature of the signal. The sine-wave speech was dubbed onto congruent and incongruent video of a talking face. Tuomainen et al. found that the McGurk effect did not occur for naïve observers, but did occur when observers were informed. This indicates that the McGurk illusion is due to a mechanism...... of audiovisual integration specific to speech perception. However, the results of Tuomainen et al. might have been influenced by another effect. When observers were naïve, they had little motivation to look at the face. When informed, they knew that the face was relevant for the task and this could increase...... their motivation for looking at the face. Since Tuomainen et al. did not monitor eye-movements in their experiments the magnitude of the effect of motivation is unknown. The purpose of our first experiment was to replicate Tuomainen et al.’s findings while controlling observers’ eye movements using a secondary...
García-Pérez, Miguel A; Alcalá-Quintana, Rocío
Research on asynchronous audiovisual speech perception manipulates experimental conditions to observe their effects on synchrony judgments. Probabilistic models establish a link between the sensory and decisional processes underlying such judgments and the observed data, via interpretable parameters that allow testing hypotheses and making inferences about how experimental manipulations affect such processes. Two models of this type have recently been proposed, one based on independent channels and the other using a Bayesian approach. Both models are fitted here to a common data set, with a subsequent analysis of the interpretation they provide about how experimental manipulations affected the processes underlying perceived synchrony. The data consist of synchrony judgments as a function of audiovisual offset in a speech stimulus, under four within-subjects manipulations of the quality of the visual component. The Bayesian model could not accommodate asymmetric data, was rejected by goodness-of-fit statistics for 8/16 observers, and was found to be nonidentifiable, which renders uninterpretable parameter estimates. The independent-channels model captured asymmetric data, was rejected for only 1/16 observers, and identified how sensory and decisional processes mediating asynchronous audiovisual speech perception are affected by manipulations that only alter the quality of the visual component of the speech signal.
Lewkowicz, David J; Minar, Nicholas J; Tift, Amy H; Brandon, Melissa
To investigate the developmental emergence of the perception of the multisensory coherence of native and non-native audiovisual fluent speech, we tested 4-, 8- to 10-, and 12- to 14-month-old English-learning infants. Infants first viewed two identical female faces articulating two different monologues in silence and then in the presence of an audible monologue that matched the visible articulations of one of the faces. Neither the 4-month-old nor 8- to 10-month-old infants exhibited audiovisual matching in that they did not look longer at the matching monologue. In contrast, the 12- to 14-month-old infants exhibited matching and, consistent with the emergence of perceptual expertise for the native language, perceived the multisensory coherence of native-language monologues earlier in the test trials than that of non-native language monologues. Moreover, the matching of native audible and visible speech streams observed in the 12- to 14-month-olds did not depend on audiovisual synchrony, whereas the matching of non-native audible and visible speech streams did depend on synchrony. Overall, the current findings indicate that the perception of the multisensory coherence of fluent audiovisual speech emerges late in infancy, that audiovisual synchrony cues are more important in the perception of the multisensory coherence of non-native speech than that of native audiovisual speech, and that the emergence of this skill most likely is affected by perceptual narrowing.
Full Text Available Previous studies have found that perception in older people benefits from multisensory over uni-sensory information. As normal speech recognition is affected by both the auditory input and the visual lip-movements of the speaker, we investigated the efficiency of audio and visual integration in an older population by manipulating the relative reliability of the auditory and visual information in speech. We also investigated the role of the semantic context of the sentence to assess whether audio-visual integration is affected by top-down semantic processing. We presented participants with audio-visual sentences in which the visual component was either blurred or not blurred. We found that there was a greater cost in recall performance for semantically meaningless speech in the audio-visual blur compared to audio-visual no blur condition and this effect was specific to the older group. Our findings have implications for understanding how aging affects efficient multisensory integration for the perception of speech and suggests that multisensory inputs may benefit speech perception in older adults when the semantic content of the speech is unpredictable.
Richoz, Anne-Raphaëlle; Quinn, Paul C.; Hillairet de Boisferon, Anne; Berger, Carole; Loevenbruck, Hélène; Lewkowicz, David J.; Lee, Kang; Dole, Marjorie; Caldara, Roberto; Pascalis, Olivier
Early multisensory perceptual experiences shape the abilities of infants to perform socially-relevant visual categorization, such as the extraction of gender, age, and emotion from faces. Here, we investigated whether multisensory perception of gender is influenced by infant-directed (IDS) or adult-directed (ADS) speech. Six-, 9-, and 12-month-old infants saw side-by-side silent video-clips of talking faces (a male and a female) and heard either a soundtrack of a female or a male voice telling a story in IDS or ADS. Infants participated in only one condition, either IDS or ADS. Consistent with earlier work, infants displayed advantages in matching female relative to male faces and voices. Moreover, the new finding that emerged in the current study was that extraction of gender from face and voice was stronger at 6 months with ADS than with IDS, whereas at 9 and 12 months, matching did not differ for IDS versus ADS. The results indicate that the ability to perceive gender in audiovisual speech is influenced by speech manner. Our data suggest that infants may extract multisensory gender information developmentally earlier when looking at adults engaged in conversation with other adults (i.e., ADS) than when adults are directly talking to them (i.e., IDS). Overall, our findings imply that the circumstances of social interaction may shape early multisensory abilities to perceive gender. PMID:28060872
Yu, Luodi; Rao, Aparna; Zhang, Yang; Burton, Philip C.; Rishiq, Dania; Abrams, Harvey
Although audiovisual (AV) training has been shown to improve overall speech perception in hearing-impaired listeners, there has been a lack of direct brain imaging data to help elucidate the neural networks and neural plasticity associated with hearing aid (HA) use and auditory training targeting speechreading. For this purpose, the current clinical case study reports functional magnetic resonance imaging (fMRI) data from two hearing-impaired patients who were first-time HA users. During the study period, both patients used HAs for 8 weeks; only one received a training program named ReadMyQuipsTM (RMQ) targeting speechreading during the second half of the study period for 4 weeks. Identical fMRI tests were administered at pre-fitting and at the end of the 8 weeks. Regions of interest (ROI) including auditory cortex and visual cortex for uni-sensory processing, and superior temporal sulcus (STS) for AV integration, were identified for each person through independent functional localizer task. The results showed experience-dependent changes involving ROIs of auditory cortex, STS and functional connectivity between uni-sensory ROIs and STS from pretest to posttest in both cases. These data provide initial evidence for the malleable experience-driven cortical functionality for AV speech perception in elderly hearing-impaired people and call for further studies with a much larger subject sample and systematic control to fill in the knowledge gap to understand brain plasticity associated with auditory rehabilitation in the aging population. PMID:28270763
Kumar, G Vinodh; Halder, Tamesh; Jaiswal, Amit K; Mukherjee, Abhishek; Roy, Dipanjan; Banerjee, Arpan
Observable lip movements of the speaker influence perception of auditory speech. A classical example of this influence is reported by listeners who perceive an illusory (cross-modal) speech sound (McGurk-effect) when presented with incongruent audio-visual (AV) speech stimuli. Recent neuroimaging studies of AV speech perception accentuate the role of frontal, parietal, and the integrative brain sites in the vicinity of the superior temporal sulcus (STS) for multisensory speech perception. However, if and how does the network across the whole brain participates during multisensory perception processing remains an open question. We posit that a large-scale functional connectivity among the neural population situated in distributed brain sites may provide valuable insights involved in processing and fusing of AV speech. Varying the psychophysical parameters in tandem with electroencephalogram (EEG) recordings, we exploited the trial-by-trial perceptual variability of incongruent audio-visual (AV) speech stimuli to identify the characteristics of the large-scale cortical network that facilitates multisensory perception during synchronous and asynchronous AV speech. We evaluated the spectral landscape of EEG signals during multisensory speech perception at varying AV lags. Functional connectivity dynamics for all sensor pairs was computed using the time-frequency global coherence, the vector sum of pairwise coherence changes over time. During synchronous AV speech, we observed enhanced global gamma-band coherence and decreased alpha and beta-band coherence underlying cross-modal (illusory) perception compared to unisensory perception around a temporal window of 300-600 ms following onset of stimuli. During asynchronous speech stimuli, a global broadband coherence was observed during cross-modal perception at earlier times along with pre-stimulus decreases of lower frequency power, e.g., alpha rhythms for positive AV lags and theta rhythms for negative AV lags. Thus, our
Kumar, G. Vinodh; Halder, Tamesh; Jaiswal, Amit K.; Mukherjee, Abhishek; Roy, Dipanjan; Banerjee, Arpan
Observable lip movements of the speaker influence perception of auditory speech. A classical example of this influence is reported by listeners who perceive an illusory (cross-modal) speech sound (McGurk-effect) when presented with incongruent audio-visual (AV) speech stimuli. Recent neuroimaging studies of AV speech perception accentuate the role of frontal, parietal, and the integrative brain sites in the vicinity of the superior temporal sulcus (STS) for multisensory speech perception. However, if and how does the network across the whole brain participates during multisensory perception processing remains an open question. We posit that a large-scale functional connectivity among the neural population situated in distributed brain sites may provide valuable insights involved in processing and fusing of AV speech. Varying the psychophysical parameters in tandem with electroencephalogram (EEG) recordings, we exploited the trial-by-trial perceptual variability of incongruent audio-visual (AV) speech stimuli to identify the characteristics of the large-scale cortical network that facilitates multisensory perception during synchronous and asynchronous AV speech. We evaluated the spectral landscape of EEG signals during multisensory speech perception at varying AV lags. Functional connectivity dynamics for all sensor pairs was computed using the time-frequency global coherence, the vector sum of pairwise coherence changes over time. During synchronous AV speech, we observed enhanced global gamma-band coherence and decreased alpha and beta-band coherence underlying cross-modal (illusory) perception compared to unisensory perception around a temporal window of 300–600 ms following onset of stimuli. During asynchronous speech stimuli, a global broadband coherence was observed during cross-modal perception at earlier times along with pre-stimulus decreases of lower frequency power, e.g., alpha rhythms for positive AV lags and theta rhythms for negative AV lags. Thus
Williams, Joshua T; Darcy, Isabelle; Newman, Sharlene D
The aim of the present study was to characterize effects of learning a sign language on the processing of a spoken language. Specifically, audiovisual phoneme comprehension was assessed before and after 13 weeks of sign language exposure. L2 ASL learners performed this task in the fMRI scanner. Results indicated that L2 American Sign Language (ASL) learners' behavioral classification of the speech sounds improved with time compared to hearing nonsigners. Results indicated increased activation in the supramarginal gyrus (SMG) after sign language exposure, which suggests concomitant increased phonological processing of speech. A multiple regression analysis indicated that learner's rating on co-sign speech use and lipreading ability was correlated with SMG activation. This pattern of results indicates that the increased use of mouthing and possibly lipreading during sign language acquisition may concurrently improve audiovisual speech processing in budding hearing bimodal bilinguals.
Stevenson, Ryan A.; Siemann, Justin K.; Woynaroski, Tiffany G.; Schneider, Brittany C.; Eberly, Haley E.; Camarata, Stephen M.; Wallace, Mark T.
Atypical communicative abilities are a core marker of Autism Spectrum Disorders (ASD). A number of studies have shown that, in addition to auditory comprehension differences, individuals with autism frequently show atypical responses to audiovisual speech, suggesting a multisensory contribution to these communicative differences from their…
Stevenson, Ryan A; Nelms, Caitlin E; Baum, Sarah H; Zurkovsky, Lilia; Barense, Morgan D; Newhouse, Paul A; Wallace, Mark T
Over the next 2 decades, a dramatic shift in the demographics of society will take place, with a rapid growth in the population of older adults. One of the most common complaints with healthy aging is a decreased ability to successfully perceive speech, particularly in noisy environments. In such noisy environments, the presence of visual speech cues (i.e., lip movements) provide striking benefits for speech perception and comprehension, but previous research suggests that older adults gain less from such audiovisual integration than their younger peers. To determine at what processing level these behavioral differences arise in healthy-aging populations, we administered a speech-in-noise task to younger and older adults. We compared the perceptual benefits of having speech information available in both the auditory and visual modalities and examined both phoneme and whole-word recognition across varying levels of signal-to-noise ratio. For whole-word recognition, older adults relative to younger adults showed greater multisensory gains at intermediate SNRs but reduced benefit at low SNRs. By contrast, at the phoneme level both younger and older adults showed approximately equivalent increases in multisensory gain as signal-to-noise ratio decreased. Collectively, the results provide important insights into both the similarities and differences in how older and younger adults integrate auditory and visual speech cues in noisy environments and help explain some of the conflicting findings in previous studies of multisensory speech perception in healthy aging. These novel findings suggest that audiovisual processing is intact at more elementary levels of speech perception in healthy-aging populations and that deficits begin to emerge only at the more complex word-recognition level of speech signals.
Stevenson, Ryan A.; Nelms, Caitlin; Baum, Sarah H.; Zurkovsky, Lilia; Barense, Morgan D.; Newhouse, Paul A.; Wallace, Mark T.
Over the next two decades, a dramatic shift in the demographics of society will take place, with a rapid growth in the population of older adults. One of the most common complaints with healthy aging is a decreased ability to successfully perceive speech, particularly in noisy environments. In such noisy environments, the presence of visual speech cues (i.e., lip movements) provide striking benefits for speech perception and comprehension, but previous research suggests that older adults gain less from such audiovisual integration than their younger peers. To determine at what processing level these behavioral differences arise in healthy-aging populations, we administered a speech-in-noise task to younger and older adults. We compared the perceptual benefits of having speech information available in both the auditory and visual modalities and examined both phoneme and whole-word recognition across varying levels of signal-to-noise ratio (SNR). For whole-word recognition, older relative to younger adults showed greater multisensory gains at intermediate SNRs, but reduced benefit at low SNRs. By contrast, at the phoneme level both younger and older adults showed approximately equivalent increases in multisensory gain as SNR decreased. Collectively, the results provide important insights into both the similarities and differences in how older and younger adults integrate auditory and visual speech cues in noisy environments, and help explain some of the conflicting findings in previous studies of multisensory speech perception in healthy aging. These novel findings suggest that audiovisual processing is intact at more elementary levels of speech perception in healthy aging populations, and that deficits begin to emerge only at the more complex, word-recognition level of speech signals. PMID:25282337
Loh, Marco; Schmid, Gabriele; Deco, Gustavo; Ziegler, Wolfram
Audiovisual speech perception provides an opportunity to investigate the mechanisms underlying multimodal processing. By using nonspeech stimuli, it is possible to investigate the degree to which audiovisual processing is specific to the speech domain. It has been shown in a match-to-sample design that matching across modalities is more difficult…
Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias
Speech perception integrates auditory and visual information. This is evidenced by the McGurk illusion where seeing the talking face influences the auditory phonetic percept and by the audiovisual detection advantage where seeing the talking face influences the detectability of the acoustic speech...... signal. Here we show that identification of phonetic content and detection can be dissociated as speech-specific and non-specific audiovisual integration effects. To this end, we employed synthetically modified stimuli, sine wave speech (SWS), which is an impoverished speech signal that only observers...... informed of its speech-like nature recognize as speech. While the McGurk illusion only occurred for informed observers the audiovisual detection advantage occurred for naïve observers as well. This finding supports a multi-stage account of audiovisual integration of speech in which the many attributes...
Eskelund, Kasper; Tuomainen, Jyrki; Andersen, Tobias
Seeing the talker’s articulatory mouth movements can influence the auditory speech percept both in speech identification and detection tasks. Here we show that these audiovisual integration effects also occur for sine wave speech (SWS), which is an impoverished speech signal that naïve observers...... often fail to perceive as speech. While audiovisual integration in the identification task only occurred when observers were informed of the speech-like nature of SWS, integration occurred in the detection task both for informed and naïve observers. This shows that both speech-specific and general...
Drullman, R; Smoorenburg, G F
For many people with profound hearing loss conventional hearing aids give only little support in speechreading. This study aims at optimizing the presentation of speech signals in the severely reduced dynamic range of the profoundly hearing impaired by means of multichannel compression and multichannel amplification. The speech signal in each of six 1-octave channels (125-4000 Hz) was compressed instantaneously, using compression ratios of 1, 2, 3, or 5, and a compression threshold of 35 dB below peak level. A total of eight conditions were composed in which the compression ratio varied per channel. Sentences were presented audio-visually to 16 profoundly hearing-impaired subjects and syllable intelligibility was measured. Results show that all auditory signals are valuable supplements to speechreading. No clear overall preference is found for any of the compression conditions, but relatively high compression ratios (> 3-5) have a significantly detrimental effect. Inspection of the individual results reveals that compression may be beneficial for one subject.
Full Text Available It is well known that simultaneous presentation of incongruent audio and visual stimuli can lead to illusory percepts. Recent data suggest that distinct processes underlie non-specific intersensory speech as opposed to non-speech perception. However, the development of both speech and non-speech intersensory perception across childhood and adolescence remains poorly defined. Thirty-eight observers aged 5 to 19 were tested on the McGurk effect (an audio-visual illusion involving speech, the Illusory Flash effect and the Fusion effect (two audio-visual illusions not involving speech to investigate the development of audio-visual interactions and contrast speech vs. non-speech developmental patterns. Whereas the strength of audio-visual speech illusions varied as a direct function of maturational level, performance on non-speech illusory tasks appeared to be homogeneous across all ages. These data support the existence of independent maturational processes underlying speech and non-speech audio-visual illusory effects.
Full Text Available Speech researchers have long been interested in how auditory and visual speech signals are integrated, and recent work has revived interest in the role of speech production with respect to this process. Here we discuss these issues from a developmental perspective. Because speech perception abilities typically outstrip speech production abilities in infancy and childhood, it is unclear how speech-like movements could influence audiovisual speech perception in development. While work on this question is still in its preliminary stages, there is nevertheless increasing evidence that sensorimotor processes (defined here as any motor or proprioceptive process related to orofacial movements affect developmental audiovisual speech processing. We suggest three areas on which to focus in future research: i the relation between audiovisual speech perception and sensorimotor processes at birth, ii the pathways through which sensorimotor processes interact with audiovisual speech processing in infancy, and iii developmental change in sensorimotor pathways as speech production emerges in childhood.
Guellaï, Bahia; Streri, Arlette; Yeung, H Henny
Speech researchers have long been interested in how auditory and visual speech signals are integrated, and the recent work has revived interest in the role of speech production with respect to this process. Here, we discuss these issues from a developmental perspective. Because speech perception abilities typically outstrip speech production abilities in infancy and childhood, it is unclear how speech-like movements could influence audiovisual speech perception in development. While work on this question is still in its preliminary stages, there is nevertheless increasing evidence that sensorimotor processes (defined here as any motor or proprioceptive process related to orofacial movements) affect developmental audiovisual speech processing. We suggest three areas on which to focus in future research: (i) the relation between audiovisual speech perception and sensorimotor processes at birth, (ii) the pathways through which sensorimotor processes interact with audiovisual speech processing in infancy, and (iii) developmental change in sensorimotor pathways as speech production emerges in childhood.
Pons, Ferran; Andreu, Llorenc; Sanz-Torrent, Monica; Buil-Legaz, Lucia; Lewkowicz, David J.
Speech perception involves the integration of auditory and visual articulatory information, and thus requires the perception of temporal synchrony between this information. There is evidence that children with specific language impairment (SLI) have difficulty with auditory speech perception but it is not known if this is also true for the…
Andersen, Tobias S; Starrfelt, Randi
Lesions to Broca's area cause aphasia characterized by a severe impairment of the ability to speak, with comparatively intact speech perception. However, some studies have found effects on speech perception under adverse listening conditions, indicating that Broca's area is also involved in speech perception. While these studies have focused on auditory speech perception other studies have shown that Broca's area is activated by visual speech perception. Furthermore, one preliminary report found that a patient with Broca's aphasia did not experience the McGurk illusion suggesting that an intact Broca's area is necessary for audiovisual integration of speech. Here we describe a patient with Broca's aphasia who experienced the McGurk illusion. This indicates that an intact Broca's area is not necessary for audiovisual integration of speech. The McGurk illusions this patient experienced were atypical, which could be due to Broca's area having a more subtle role in audiovisual integration of speech. The McGurk illusions of a control subject with Wernicke's aphasia were, however, also atypical. This indicates that the atypical McGurk illusions were due to deficits in speech processing that are not specific to Broca's aphasia.
. Dependent on the training data, these other units may also be contextually immediate neighboring units. The poster demonstrates the idea with text material spoken by one individual subject using a set of simple audio-visual features. The data material for the training process consists of 44 labeled...... visual lip features is used. Phoneme-related receptive fields result on the SOM basis; they are speaker dependent and show individual locations and strain. Overlapping main slopes indicate a high similarity of respective units; distortion or extra peaks originate from the influence of other units...... sentences in German with a balanced phoneme repertoire. As a result it can be stated that (i) the SOM can be trained to map auditory and visual features in a topology-preserving way and (ii) they show strain due to the influence of other audio-visual units. The SOM can be used to measure similarity amongst...
Maier, Joost X.; Di Luca, Massimiliano; Noppeney, Uta
Combining information from the visual and auditory senses can greatly enhance intelligibility of natural speech. Integration of audiovisual speech signals is robust even when temporal offsets are present between the component signals. In the present study, we characterized the temporal integration window for speech and nonspeech stimuli with…
Megnin-Viggars, Odette; Goswami, Usha
Visual speech inputs can enhance auditory speech information, particularly in noisy or degraded conditions. The natural statistics of audiovisual speech highlight the temporal correspondence between visual and auditory prosody, with lip, jaw, cheek and head movements conveying information about the speech envelope. Low-frequency spatial and…
Barrós-Loscertales, Alfonso; Ventura-Campos, Noelia; Visser, Maya; Alsius, Agnès; Pallier, Christophe; Avila Rivera, César; Soto-Faraco, Salvador
Neuroimaging studies of audiovisual speech processing have exclusively addressed listeners' native language (L1). Yet, several behavioural studies now show that AV processing plays an important role in non-native (L2) speech perception. The current fMRI study measured brain activity during auditory, visual, audiovisual congruent and audiovisual incongruent utterances in L1 and L2. BOLD responses to congruent AV speech in the pSTS were stronger than in either unimodal condition in both L1 and L2. Yet no differences in AV processing were expressed according to the language background in this area. Instead, the regions in the bilateral occipital lobe had a stronger congruency effect on the BOLD response (congruent higher than incongruent) in L2 as compared to L1. According to these results, language background differences are predominantly expressed in these unimodal regions, whereas the pSTS is similarly involved in AV integration regardless of language dominance.
Full Text Available Although infant speech perception in often studied in isolated modalities, infants' experience with speech is largely multimodal (i.e., speech sounds they hear are accompanied by articulating faces. Across two experiments, we tested infants' sensitivity to the relationship between the auditory and visual components of audiovisual speech in their native (English and non-native (Spanish language. In Experiment 1, infants' looking times were measured during a preferential looking task in which they saw two simultaneous visual speech streams articulating a story, one in English and the other in Spanish, while they heard either the English or the Spanish version of the story. In Experiment 2, looking times from another group of infants were measured as they watched single displays of congruent and incongruent combinations of English and Spanish audio and visual speech streams. Findings demonstrated an age-related increase in looking towards the native relative to non-native visual speech stream when accompanied by the corresponding (native auditory speech. This increase in native language preference did not appear to be driven by a difference in preference for native vs. non-native audiovisual congruence as we observed no difference in looking times at the audiovisual streams in Experiment 2.
Traditionally, second language (L2) instruction has emphasised auditory-based instruction methods. However, this approach is restrictive in the sense that speech perception by humans is not just an auditory phenomenon but a multimodal one, and specifically, a visual one as well. In the past decade, experimental studies have shown that the…
Klitsch, Julia Ulrike
This dissertation investigates speech perception in three different groups of native adult speakers of Dutch; an aphasic and two age-varying control groups. By means of two different experiments it is examined if the availability of visual articulatory information is beneficial to the auditory speec
Andersen, Tobias; Starrfelt, Randi
Lesions to Broca's area cause aphasia characterized by a severe impairment of the ability to speak, with comparatively intact speech perception. However, some studies have found effects on speech perception under adverse listening conditions, indicating that Broca's area is also involved in speech...... perception. While these studies have focused on auditory speech perception other studies have shown that Broca's area is activated by visual speech perception. Furthermore, one preliminary report found that a patient with Broca's aphasia did not experience the McGurk illusion suggesting that an intact Broca......'s area is necessary for audiovisual integration of speech. Here we describe a patient with Broca's aphasia who experienced the McGurk illusion. This indicates that an intact Broca's area is not necessary for audiovisual integration of speech. The McGurk illusions this patient experienced were atypical...
Kaganovich, Natalya; Schumaker, Jennifer
Previous studies have demonstrated that the presence of visual speech cues reduces the amplitude and latency of the N1 and P2 event-related potential (ERP) components elicited by speech stimuli. However, the developmental trajectory of this effect is not yet fully mapped. We examined ERP responses to auditory, visual, and audiovisual speech in two groups of school-age children (7-8-year-olds and 10-11-year-olds) and in adults. Audiovisual speech led to the attenuation of the N1 and P2 components in all groups of participants, suggesting that the neural mechanisms underlying these effects are functional by early school years. Additionally, while the reduction in N1 was largest over the right scalp, the P2 attenuation was largest over the left and midline scalp. The difference in the hemispheric distribution of the N1 and P2 attenuation supports the idea that these components index at least somewhat disparate neural processes within the context of audiovisual speech perception.
Van der Burg, Erik; Goodbourn, Patrick T
The brain is adaptive. The speed of propagation through air, and of low-level sensory processing, differs markedly between auditory and visual stimuli; yet the brain can adapt to compensate for the resulting cross-modal delays. Studies investigating temporal recalibration to audiovisual speech have used prolonged adaptation procedures, suggesting that adaptation is sluggish. Here, we show that adaptation to asynchronous audiovisual speech occurs rapidly. Participants viewed a brief clip of an actor pronouncing a single syllable. The voice was either advanced or delayed relative to the corresponding lip movements, and participants were asked to make a synchrony judgement. Although we did not use an explicit adaptation procedure, we demonstrate rapid recalibration based on a single audiovisual event. We find that the point of subjective simultaneity on each trial is highly contingent upon the modality order of the preceding trial. We find compelling evidence that rapid recalibration generalizes across different stimuli, and different actors. Finally, we demonstrate that rapid recalibration occurs even when auditory and visual events clearly belong to different actors. These results suggest that rapid temporal recalibration to audiovisual speech is primarily mediated by basic temporal factors, rather than higher-order factors such as perceived simultaneity and source identity.
Smith, Elizabeth G.; Bennetto, Loisa
Background: During speech perception, the ability to integrate auditory and visual information causes speech to sound louder and be more intelligible, and leads to quicker processing. This integration is important in early language development, and also continues to affect speech comprehension throughout the lifespan. Previous research shows that…
Shinozaki, Jun; Hiroe, Nobuo; Sato, Masa-aki; Nagamine, Takashi; Sekiyama, Kaoru
Visual information about lip and facial movements plays a role in audiovisual (AV) speech perception. Although this has been widely confirmed, previous behavioural studies have shown interlanguage differences, that is, native Japanese speakers do not integrate auditory and visual speech as closely as native English speakers. To elucidate the neural basis of such interlanguage differences, 22 native English speakers and 24 native Japanese speakers were examined in behavioural or functional Magnetic Resonance Imaging (fMRI) experiments while mono-syllabic speech was presented under AV, auditory-only, or visual-only conditions for speech identification. Behavioural results indicated that the English speakers identified visual speech more quickly than the Japanese speakers, and that the temporal facilitation effect of congruent visual speech was significant in the English speakers but not in the Japanese speakers. Using fMRI data, we examined the functional connectivity among brain regions important for auditory-visual interplay. The results indicated that the English speakers had significantly stronger connectivity between the visual motion area MT and the Heschl’s gyrus compared with the Japanese speakers, which may subserve lower-level visual influences on speech perception in English speakers in a multisensory environment. These results suggested that linguistic experience strongly affects neural connectivity involved in AV speech integration. PMID:27510407
Shinozaki, Jun; Hiroe, Nobuo; Sato, Masa-Aki; Nagamine, Takashi; Sekiyama, Kaoru
Visual information about lip and facial movements plays a role in audiovisual (AV) speech perception. Although this has been widely confirmed, previous behavioural studies have shown interlanguage differences, that is, native Japanese speakers do not integrate auditory and visual speech as closely as native English speakers. To elucidate the neural basis of such interlanguage differences, 22 native English speakers and 24 native Japanese speakers were examined in behavioural or functional Magnetic Resonance Imaging (fMRI) experiments while mono-syllabic speech was presented under AV, auditory-only, or visual-only conditions for speech identification. Behavioural results indicated that the English speakers identified visual speech more quickly than the Japanese speakers, and that the temporal facilitation effect of congruent visual speech was significant in the English speakers but not in the Japanese speakers. Using fMRI data, we examined the functional connectivity among brain regions important for auditory-visual interplay. The results indicated that the English speakers had significantly stronger connectivity between the visual motion area MT and the Heschl's gyrus compared with the Japanese speakers, which may subserve lower-level visual influences on speech perception in English speakers in a multisensory environment. These results suggested that linguistic experience strongly affects neural connectivity involved in AV speech integration.
Martin, R R; Haroldson, S K
Unsophisticated raters, using 9-point interval scales, judged speech naturalness and stuttering severity of recorded stutterer and nonstutterer speech samples. Raters judged separately the audio-only and audiovisual presentations of each sample. For speech naturalness judgments of stutterer samples, raters invariably judged the audiovisual presentation more unnatural than the audio presentation of the same sample; but for the nonstutterer samples, there was no difference between audio and audiovisual naturalness ratings. Stuttering severity ratings did not differ significantly between audio and audiovisual presentations of the same samples. Rater reliability, interrater agreement, and intrarater agreement for speech naturalness judgments were assessed.
Tobias Søren Andersen
Full Text Available Lesions to Broca’s area cause aphasia characterised by a severe impairment of the ability to speak, with comparatively intact speech perception. However, some studies have found effects on speech perception under adverse listening conditions, indicating that Broca’s area is also involved in speech perception. While these studies have focused on auditory speech perception other studies have shown that Broca’s area is activated by visual speech perception. Furthermore, one preliminary report found that a patient with Broca’s aphasia did not experience the McGurk illusion suggesting that an intact Broca’s area is necessary for audiovisual integration of speech. Here we describe a patient with Broca’s aphasia who experienced the McGurk illusion. This indicates that an intact Broca’s area is not necessary for audiovisual integration of speech. The McGurk illusions this patient experienced were atypical, which could be due to Broca’s area having a more subtle role in audiovisual integration of speech. The McGurk illusions of a control subject with Wernicke’s aphasia were, however, also atypical. This indicates that the atypical McGurk illusions were due to deficits in speech processing that are not specific to Broca’s aphasia.
Altieri, Nicholas; Wenger, Michael J
Speech perception engages both auditory and visual modalities. Limitations of traditional accuracy-only approaches in the investigation of audiovisual speech perception have motivated the use of new methodologies. In an audiovisual speech identification task, we utilized capacity (Townsend and Nozawa, 1995), a dynamic measure of efficiency, to quantify audiovisual integration. Capacity was used to compare RT distributions from audiovisual trials to RT distributions from auditory-only and visual-only trials across three listening conditions: clear auditory signal, S/N ratio of -12 dB, and S/N ratio of -18 dB. The purpose was to obtain EEG recordings in conjunction with capacity to investigate how a late ERP co-varies with integration efficiency. Results showed efficient audiovisual integration for low auditory S/N ratios, but inefficient audiovisual integration when the auditory signal was clear. The ERP analyses showed evidence for greater audiovisual amplitude compared to the unisensory signals for lower auditory S/N ratios (higher capacity/efficiency) compared to the high S/N ratio (low capacity/inefficient integration). The data are consistent with an interactive framework of integration, where auditory recognition is influenced by speech-reading as a function of signal clarity.
Full Text Available Speech perception engages both auditory and visual modalities. Limitations of traditional accuracy-only approaches in the investigation of audiovisual speech perception have motivated the use of new methodologies. In an audiovisual speech identification task, we utilized capacity (Townsend & Nozawa, 1995, a dynamic measure of efficiency, to quantify audiovisual integration. Capacity was used to compare RT distributions from audiovisual trials to RT distributions from auditory-only and visual-only trials across three listening conditions: clear auditory signal, S/N ratio of -12 dB, and S/N ratio of -18 dB. The purpose was to obtain EEG recordings in conjunction with capacity to investigate how a late ERP co-varies with integration efficiency. Results showed efficient audiovisual integration for low auditory S/N ratios, but inefficient audiovisual integration when the auditory signal was clear. The ERP analyses showed evidence for greater audiovisual amplitude in lower auditory S/N ratios (higher capacity/efficiency compared to the high S/N ratio (low capacity/inefficient integration. The data are consistent with an interactive framework of integration, where auditory recognition is influenced by speech-reading as a function of signal clarity.
Frisch, Stefan A.; Nikjeh, Dee Adams
A preliminary audiovisual database of English speech sounds has been developed for teaching purposes. This database contains all Standard English speech sounds produced in isolated words in word initial, word medial, and word final position, unless not allowed by English phonotactics. There is one example of each word spoken by a male and a female talker. The database consists of an audio recording, video of the face from a 45 deg angle off of center, and ultrasound video of the tongue in the mid-saggital plane. The files contained in the database are suitable for examination by the Wavesurfer freeware program in audio or video modes [Sjolander and Beskow, KTH Stockholm]. This database is intended as a multimedia reference for students in phonetics or speech science. A demonstration and plans for further development will be presented.
Lüttke, Claudia S; Ekman, Matthias; van Gerven, Marcel A J; de Lange, Floris P
Auditory speech perception can be altered by concurrent visual information. The superior temporal cortex is an important combining site for this integration process. This area was previously found to be sensitive to audiovisual congruency. However, the direction of this congruency effect (i.e., stronger or weaker activity for congruent compared to incongruent stimulation) has been more equivocal. Here, we used fMRI to look at the neural responses of human participants during the McGurk illusion--in which auditory /aba/ and visual /aga/ inputs are fused to perceived /ada/--in a large homogenous sample of participants who consistently experienced this illusion. This enabled us to compare the neuronal responses during congruent audiovisual stimulation with incongruent audiovisual stimulation leading to the McGurk illusion while avoiding the possible confounding factor of sensory surprise that can occur when McGurk stimuli are only occasionally perceived. We found larger activity for congruent audiovisual stimuli than for incongruent (McGurk) stimuli in bilateral superior temporal cortex, extending into the primary auditory cortex. This finding suggests that superior temporal cortex prefers when auditory and visual input support the same representation.
Lee, HweeLing; Noppeney, Uta
Face-to-face communication challenges the human brain to integrate information from auditory and visual senses with linguistic representations. Yet the role of bottom-up physical (spectrotemporal structure) input and top-down linguistic constraints in shaping the neural mechanisms specialized for integrating audiovisual speech signals are currently unknown. Participants were presented with speech and sinewave speech analogs in visual, auditory, and audiovisual modalities. Before the fMRI study, they were trained to perceive physically identical sinewave speech analogs as speech (SWS-S) or nonspeech (SWS-N). Comparing audiovisual integration (interactions) of speech, SWS-S, and SWS-N revealed a posterior-anterior processing gradient within the left superior temporal sulcus/gyrus (STS/STG): Bilateral posterior STS/STG integrated audiovisual inputs regardless of spectrotemporal structure or speech percept; in left mid-STS, the integration profile was primarily determined by the spectrotemporal structure of the signals; more anterior STS regions discarded spectrotemporal structure and integrated audiovisual signals constrained by stimulus intelligibility and the availability of linguistic representations. In addition to this "ventral" processing stream, a "dorsal" circuitry encompassing posterior STS/STG and left inferior frontal gyrus differentially integrated audiovisual speech and SWS signals. Indeed, dynamic causal modeling and Bayesian model comparison provided strong evidence for a parallel processing structure encompassing a ventral and a dorsal stream with speech intelligibility training enhancing the connectivity between posterior and anterior STS/STG. In conclusion, audiovisual speech comprehension emerges in an interactive process with the integration of auditory and visual signals being progressively constrained by stimulus intelligibility along the STS and spectrotemporal structure in a dorsal fronto-temporal circuitry.
Baart, Martijn; Stekelenburg, Jeroen J; Vroomen, Jean
Lip-read speech is integrated with heard speech at various neural levels. Here, we investigated the extent to which lip-read induced modulations of the auditory N1 and P2 (measured with EEG) are indicative of speech-specific audiovisual integration, and we explored to what extent the ERPs were modulated by phonetic audiovisual congruency. In order to disentangle speech-specific (phonetic) integration from non-speech integration, we used Sine-Wave Speech (SWS) that was perceived as speech by half of the participants (they were in speech-mode), while the other half was in non-speech mode. Results showed that the N1 obtained with audiovisual stimuli peaked earlier than the N1 evoked by auditory-only stimuli. This lip-read induced speeding up of the N1 occurred for listeners in speech and non-speech mode. In contrast, if listeners were in speech-mode, lip-read speech also modulated the auditory P2, but not if listeners were in non-speech mode, thus revealing speech-specific audiovisual binding. Comparing ERPs for phonetically congruent audiovisual stimuli with ERPs for incongruent stimuli revealed an effect of phonetic stimulus congruency that started at ~200 ms after (in)congruence became apparent. Critically, akin to the P2 suppression, congruency effects were only observed if listeners were in speech mode, and not if they were in non-speech mode. Using identical stimuli, we thus confirm that audiovisual binding involves (partially) different neural mechanisms for sound processing in speech and non-speech mode.
Hertrich, Ingo; Dietrich, Susanne; Ackermann, Hermann
During speech communication, visual information may interact with the auditory system at various processing stages. Most noteworthy, recent magnetoencephalography (MEG) data provided first evidence for early and preattentive phonetic/phonological encoding of the visual data stream--prior to its fusion with auditory phonological features [Hertrich,…
Woynaroski, Tiffany G.; Kwakye, Leslie D.; Foss-Feig, Jennifer H.; Stevenson, Ryan A.; Stone, Wendy L.; Wallace, Mark T.
This study examined unisensory and multisensory speech perception in 8-17 year old children with autism spectrum disorders (ASD) and typically developing controls matched on chronological age, sex, and IQ. Consonant-vowel syllables were presented in visual only, auditory only, matched audiovisual, and mismatched audiovisual ("McGurk")…
Jeanne A Guiraud
Full Text Available The language difficulties often seen in individuals with autism might stem from an inability to integrate audiovisual information, a skill important for language development. We investigated whether 9-month-old siblings of older children with autism, who are at an increased risk of developing autism, are able to integrate audiovisual speech cues. We used an eye-tracker to record where infants looked when shown a screen displaying two faces of the same model, where one face is articulating/ba/and the other/ga/, with one face congruent with the syllable sound being presented simultaneously, the other face incongruent. This method was successful in showing that infants at low risk can integrate audiovisual speech: they looked for the same amount of time at the mouths in both the fusible visual/ga/- audio/ba/and the congruent visual/ba/- audio/ba/displays, indicating that the auditory and visual streams fuse into a McGurk-type of syllabic percept in the incongruent condition. It also showed that low-risk infants could perceive a mismatch between auditory and visual cues: they looked longer at the mouth in the mismatched, non-fusible visual/ba/- audio/ga/display compared with the congruent visual/ga/- audio/ga/display, demonstrating that they perceive an uncommon, and therefore interesting, speech-like percept when looking at the incongruent mouth (repeated ANOVA: displays x fusion/mismatch conditions interaction: F(1,16 = 17.153, p = 0.001. The looking behaviour of high-risk infants did not differ according to the type of display, suggesting difficulties in matching auditory and visual information (repeated ANOVA, displays x conditions interaction: F(1,25 = 0.09, p = 0.767, in contrast to low-risk infants (repeated ANOVA: displays x conditions x low/high-risk groups interaction: F(1,41 = 4.466, p = 0.041. In some cases this reduced ability might lead to the poor communication skills characteristic of autism.
Roa Romero, Yadira; Senkowski, Daniel; Keil, Julian
The McGurk illusion is a prominent example of audiovisual speech perception and the influence that visual stimuli can have on auditory perception. In this illusion, a visual speech stimulus influences the perception of an incongruent auditory stimulus, resulting in a fused novel percept. In this high-density electroencephalography (EEG) study, we were interested in the neural signatures of the subjective percept of the McGurk illusion as a phenomenon of speech-specific multisensory integration. Therefore, we examined the role of cortical oscillations and event-related responses in the perception of congruent and incongruent audiovisual speech. We compared the cortical activity elicited by objectively congruent syllables with incongruent audiovisual stimuli. Importantly, the latter elicited a subjectively congruent percept: the McGurk illusion. We found that early event-related responses (N1) to audiovisual stimuli were reduced during the perception of the McGurk illusion compared with congruent stimuli. Most interestingly, our study showed a stronger poststimulus suppression of beta-band power (13-30 Hz) at short (0-500 ms) and long (500-800 ms) latencies during the perception of the McGurk illusion compared with congruent stimuli. Our study demonstrates that auditory perception is influenced by visual context and that the subsequent formation of a McGurk illusion requires stronger audiovisual integration even at early processing stages. Our results provide evidence that beta-band suppression at early stages reflects stronger stimulus processing in the McGurk illusion. Moreover, stronger late beta-band suppression in McGurk illusion indicates the resolution of incongruent physical audiovisual input and the formation of a coherent, illusory multisensory percept.
Dick, Anthony Steven; Solodkin, Ana; Small, Steven L.
Everyday conversation is both an auditory and a visual phenomenon. While visual speech information enhances comprehension for the listener, evidence suggests that the ability to benefit from this information improves with development. A number of brain regions have been implicated in audiovisual speech comprehension, but the extent to which the…
Altieri, Nicholas; Townsend, James T; Wenger, Michael J
We propose a measure of audiovisual speech integration that takes into account accuracy and response times. This measure should prove beneficial for researchers investigating multisensory speech recognition, since it relates to normal-hearing and aging populations. As an example, age-related sensory decline influences both the rate at which one processes information and the ability to utilize cues from different sensory modalities. Our function assesses integration when both auditory and visual information are available, by comparing performance on these audiovisual trials with theoretical predictions for performance under the assumptions of parallel, independent self-terminating processing of single-modality inputs. We provide example data from an audiovisual identification experiment and discuss applications for measuring audiovisual integration skills across the life span.
Altieri, Nicholas; Hudock, Daniel
Research in audiovisual speech perception has demonstrated that sensory factors such as auditory and visual acuity are associated with a listener's ability to extract and combine auditory and visual speech cues. This case study report examined audiovisual integration using a newly developed measure of capacity in a sample of hearing-impaired listeners. Capacity assessments are unique because they examine the contribution of reaction-time (RT) as well as accuracy to determine the extent to which a listener efficiently combines auditory and visual speech cues relative to independent race model predictions. Multisensory speech integration ability was examined in two experiments: an open-set sentence recognition and a closed set speeded-word recognition study that measured capacity. Most germane to our approach, capacity illustrated speed-accuracy tradeoffs that may be predicted by audiometric configuration. Results revealed that some listeners benefit from increased accuracy, but fail to benefit in terms of speed on audiovisual relative to unisensory trials. Conversely, other listeners may not benefit in the accuracy domain but instead show an audiovisual processing time benefit.
Full Text Available Audiovisual text-to-speech systems convert a written text into an audiovisual speech signal. Typically, the visual mode of the synthetic speech is synthesized separately from the audio, the latter being either natural or synthesized speech. However, the perception of mismatches between these two information streams requires experimental exploration since it could degrade the quality of the output. In order to increase the intermodal coherence in synthetic 2D photorealistic speech, we extended the well-known unit selection audio synthesis technique to work with multimodal segments containing original combinations of audio and video. Subjective experiments confirm that the audiovisual signals created by our multimodal synthesis strategy are indeed perceived as being more synchronous than those of systems in which both modes are not intrinsically coherent. Furthermore, it is shown that the degree of coherence between the auditory mode and the visual mode has an influence on the perceived quality of the synthetic visual speech fragment. In addition, the audio quality was found to have only a minor influence on the perceived visual signal's quality.
Bushmakin, Maxim; Kim, Sunah; Wallace, Mark T.; Puce, Aina; James, Thomas W.
In recent years, it has become evident that neural responses previously considered to be unisensory can be modulated by sensory input from other modalities. In this regard, visual neural activity elicited to viewing a face is strongly influenced by concurrent incoming auditory information, particularly speech. Here, we applied an additive-factors paradigm aimed at quantifying the impact that auditory speech has on visual event-related potentials (ERPs) elicited to visual speech. These multisensory interactions were measured across parametrically varied stimulus salience, quantified in terms of signal to noise, to provide novel insights into the neural mechanisms of audiovisual speech perception. First, we measured a monotonic increase of the amplitude of the visual P1-N1-P2 ERP complex during a spoken-word recognition task with increases in stimulus salience. ERP component amplitudes varied directly with stimulus salience for visual, audiovisual, and summed unisensory recordings. Second, we measured changes in multisensory gain across salience levels. During audiovisual speech, the P1 and P1-N1 components exhibited less multisensory gain relative to the summed unisensory components with reduced salience, while N1-P2 amplitude exhibited greater multisensory gain as salience was reduced, consistent with the principle of inverse effectiveness. The amplitude interactions were correlated with behavioral measures of multisensory gain across salience levels as measured by response times, suggesting that change in multisensory gain associated with unisensory salience modulations reflects an increased efficiency of visual speech processing. PMID:22367585
Lebib, Riadh; Papo, David; Douiri, Abdel; de Bode, Stella; Gillon Dowens, Margaret; Baudonnière, Pierre-Marie
Lipreading reliably improve speech perception during face-to-face conversation. Within the range of good dubbing, however, adults tolerate some audiovisual (AV) discrepancies and lipreading, then, can give rise to confusion. We used event-related brain potentials (ERPs) to study the perceptual strategies governing the intermodal processing of dynamic and bimodal speech stimuli, either congruently dubbed or not. Electrophysiological analyses revealed that non-coherent audiovisual dubbings modulated in amplitude an endogenous ERP component, the N300, we compared to a 'N400-like effect' reflecting the difficulty to integrate these conflicting pieces of information. This result adds further support for the existence of a cerebral system underlying 'integrative processes' lato sensu. Further studies should take advantage of this 'N400-like effect' with AV speech stimuli to open new perspectives in the domain of psycholinguistics.
Full Text Available The use of visual features in audio-visual speech recognition (AVSR is justified by both the speech generation mechanism, which is essentially bimodal in audio and visual representation, and by the need for features that are invariant to acoustic noise perturbation. As a result, current AVSR systems demonstrate significant accuracy improvements in environments affected by acoustic noise. In this paper, we describe the use of two statistical models for audio-visual integration, the coupled HMM (CHMM and the factorial HMM (FHMM, and compare the performance of these models with the existing models used in speaker dependent audio-visual isolated word recognition. The statistical properties of both the CHMM and FHMM allow to model the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. In our experiments, the CHMM performs best overall, outperforming all the existing models and the FHMM.
Kaganovich, Natalya; Schumaker, Jennifer; Macias, Danielle; Gustafson, Dana
Previous studies indicate that at least some aspects of audiovisual speech perception are impaired in children with specific language impairment (SLI). However, whether audiovisual processing difficulties are also present in older children with a history of this disorder is unknown. By combining electrophysiological and behavioral measures, we…
Full Text Available Humans rely on multiple sensory modalities to determine the emotional state of others. In fact, such multisensory perception may be one of the mechanisms explaining the ease and efficiency by which others’ emotions are recognized. But how and when exactly do the different modalities interact? One aspect in multisensory perception that has received increasing interest in recent years is the concept of crossmodal prediction. In emotion perception, as in most other settings, visual information precedes the auditory one. Thereby, leading in visual information can facilitate subsequent auditory processing. While this mechanism has often been described in audiovisual speech perception, it has not been addressed so far in audiovisual emotion perception. Based on the current state of the art in (a crossmodal prediction and (b multisensory emotion perception research, we propose that it is essential to consider the former in order to fully understand the latter. Focusing on electroencephalographic (EEG and magnetoencephalographic (MEG studies, we provide a brief overview of the current research in both fields. In discussing these findings, we suggest that emotional visual information may allow for a more reliable prediction of auditory information compared to non-emotional visual information. In support of this hypothesis, we present a re-analysis of a previous data set that shows an inverse correlation between the N1 response in the EEG and the duration of visual emotional but not non-emotional information. If the assumption that emotional content allows for more reliable predictions can be corroborated in future studies, crossmodal prediction is a crucial factor in our understanding of multisensory emotion perception.
Schwartz, Jean-Luc; Savariaux, Christophe
An increasing number of neuroscience papers capitalize on the assumption published in this journal that visual speech would be typically 150 ms ahead of auditory speech. It happens that the estimation of audiovisual asynchrony in the reference paper is valid only in very specific cases, for isolated consonant-vowel syllables or at the beginning of a speech utterance, in what we call "preparatory gestures". However, when syllables are chained in sequences, as they are typically in most parts of a natural speech utterance, asynchrony should be defined in a different way. This is what we call "comodulatory gestures" providing auditory and visual events more or less in synchrony. We provide audiovisual data on sequences of plosive-vowel syllables (pa, ta, ka, ba, da, ga, ma, na) showing that audiovisual synchrony is actually rather precise, varying between 20 ms audio lead and 70 ms audio lag. We show how more complex speech material should result in a range typically varying between 40 ms audio lead and 200 ms audio lag, and we discuss how this natural coordination is reflected in the so-called temporal integration window for audiovisual speech perception. Finally we present a toy model of auditory and audiovisual predictive coding, showing that visual lead is actually not necessary for visual prediction.
Full Text Available An increasing number of neuroscience papers capitalize on the assumption published in this journal that visual speech would be typically 150 ms ahead of auditory speech. It happens that the estimation of audiovisual asynchrony in the reference paper is valid only in very specific cases, for isolated consonant-vowel syllables or at the beginning of a speech utterance, in what we call "preparatory gestures". However, when syllables are chained in sequences, as they are typically in most parts of a natural speech utterance, asynchrony should be defined in a different way. This is what we call "comodulatory gestures" providing auditory and visual events more or less in synchrony. We provide audiovisual data on sequences of plosive-vowel syllables (pa, ta, ka, ba, da, ga, ma, na showing that audiovisual synchrony is actually rather precise, varying between 20 ms audio lead and 70 ms audio lag. We show how more complex speech material should result in a range typically varying between 40 ms audio lead and 200 ms audio lag, and we discuss how this natural coordination is reflected in the so-called temporal integration window for audiovisual speech perception. Finally we present a toy model of auditory and audiovisual predictive coding, showing that visual lead is actually not necessary for visual prediction.
Lee, Hweeling; Noppeney, Uta
This psychophysics study used musicians as a model to investigate whether musical expertise shapes the temporal integration window for audiovisual speech, sinewave speech, or music. Musicians and non-musicians judged the audiovisual synchrony of speech, sinewave analogs of speech, and music stimuli at 13 audiovisual stimulus onset asynchronies (±360, ±300 ±240, ±180, ±120, ±60, and 0 ms). Further, we manipulated the duration of the stimuli by presenting sentences/melodies or syllables/tones. Critically, musicians relative to non-musicians exhibited significantly narrower temporal integration windows for both music and sinewave speech. Further, the temporal integration window for music decreased with the amount of music practice, but not with age of acquisition. In other words, the more musicians practiced piano in the past 3 years, the more sensitive they became to the temporal misalignment of visual and auditory signals. Collectively, our findings demonstrate that music practicing fine-tunes the audiovisual temporal integration window to various extents depending on the stimulus class. While the effect of piano practicing was most pronounced for music, it also generalized to other stimulus classes such as sinewave speech and to a marginally significant degree to natural speech.
Eskelund, Kasper; MacDonald, Ewen; Andersen, Tobias
as demonstrated by the Thatcher illusion in which the orientation of the eyes and mouth with respect to the face is inverted (Thatcherization). This gives the face a grotesque appearance but this is only seen when the face is upright. Thatcherization can likewise disrupt visual speech perception but only when...... the face is upright indicating that facial configuration can be important for visual speech perception. This effect can propagate to auditory speech perception through audiovisual integration so that Thatcherization disrupts the McGurk illusion in which visual speech perception alters perception...... perception due to the McGurk illusion without any change in the acoustic stimulus. We found that Thatcherization disrupted a strong McGurk illusion and a correspondingly strong McGurk-MMN only for upright faces. This confirms that facial configuration can be important for audiovisual speech perception...
Bicevskis, Katie; Derrick, Donald; Gick, Bryan
Audio-visual [McGurk and MacDonald (1976). Nature 264, 746-748] and audio-tactile [Gick and Derrick (2009). Nature 462(7272), 502-504] speech stimuli enhance speech perception over audio stimuli alone. In addition, multimodal speech stimuli form an asymmetric window of integration that is consistent with the relative speeds of the various signals [Munhall, Gribble, Sacco, and Ward (1996). Percept. Psychophys. 58(3), 351-362; Gick, Ikegami, and Derrick (2010). J. Acoust. Soc. Am. 128(5), EL342-EL346]. In this experiment, participants were presented video of faces producing /pa/ and /ba/ syllables, both alone and with air puffs occurring synchronously and at different timings up to 300 ms before and after the stop release. Perceivers were asked to identify the syllable they perceived, and were more likely to respond that they perceived /pa/ when air puffs were present, with asymmetrical preference for puffs following the video signal-consistent with the relative speeds of visual and air puff signals. The results demonstrate that visual-tactile integration of speech perception occurs much as it does with audio-visual and audio-tactile stimuli. This finding contributes to the understanding of multimodal speech perception, lending support to the idea that speech is not perceived as an audio signal that is supplemented by information from other modes, but rather that primitives of speech perception are, in principle, modality neutral.
Andersen, Tobias; Starrfelt, Randi
's area is necessary for audiovisual integration of speech. Here we describe a patient with Broca's aphasia who experienced the McGurk illusion. This indicates that an intact Broca's area is not necessary for audiovisual integration of speech. The McGurk illusions this patient experienced were atypical......, which could be due to Broca's area having a more subtle role in audiovisual integration of speech. The McGurk illusions of a control subject with Wernicke's aphasia were, however, also atypical. This indicates that the atypical McGurk illusions were due to deficits in speech processing...
We present a modular framework for articulatory animation synthesis using speech motion capture data obtained with electromagnetic articulography (EMA). Adapting a skeletal animation approach, the articulatory motion data is applied to a three-dimensional (3D) model of the vocal tract, creating a portable resource that can be integrated in an audiovisual (AV) speech synthesis platform to provide realistic animation of the tongue and teeth for a virtual character. The framework also provides an interface to articulatory animation synthesis, as well as an example application to illustrate its use with a 3D game engine. We rely on cross-platform, open-source software and open standards to provide a lightweight, accessible, and portable workflow.
Full Text Available Events encoded in separate sensory modalities, such as audition and vision, can seem to be synchronous across a relatively broad range of physical timing differences. This may suggest that the precision of audio-visual timing judgments is inherently poor. Here we show that this is not necessarily true. We contrast timing sensitivity for isolated streams of audio and visual speech, and for streams of audio and visual speech accompanied by additional, temporally offset, visual speech streams. We find that the precision with which synchronous streams of audio and visual speech are identified is enhanced by the presence of additional streams of asynchronous visual speech. Our data suggest that timing perception is shaped by selective grouping processes, which can result in enhanced precision in temporally cluttered environments. The imprecision suggested by previous studies might therefore be a consequence of examining isolated pairs of audio and visual events. We argue that when an isolated pair of cross-modal events is presented, they tend to group perceptually and to seem synchronous as a consequence. We have revealed greater precision by providing multiple visual signals, possibly allowing a single auditory speech stream to group selectively with the most synchronous visual candidate. The grouping processes we have identified might be important in daily life, such as when we attempt to follow a conversation in a crowded room.
McGrath, M; Summerfield, Q
Audio-visual identification of sentences was measured as a function of audio delay in untrained observers with normal hearing; the soundtrack was replaced by rectangular pulses originally synchronized to the closing of the talker's vocal folds and then subjected to delay. When the soundtrack was delayed by 160 ms, identification scores were no better than when no acoustical information at all was provided. Delays of up to 80 ms had little effect on group-mean performance, but a separate analysis of a subgroup of better lipreaders showed a significant trend of reduced scores with increased delay in the range from 0-80 ms. A second experiment tested the interpretation that, although the main disruptive effect of the delay occurred on a syllabic time scale, better lipreaders might be attempting to use intermodal timing cues at a phonemic level. Normal-hearing observers determined whether a 120-Hz complex tone started before or after the opening of a pair of liplike Lissajou figures. Group-mean difference limens (70.7% correct DLs) were - 79 ms (sound leading) and + 138 ms (sound lagging), with no significant correlation between DLs and sentence lipreading scores. It was concluded that most observers, whether good lipreaders or not, possess insufficient sensitivity to intermodal timing cues in audio-visual speech for them to be used analogously to voice onset time in auditory speech perception. The results of both experiments imply that delays of up to about 40 ms introduced by signal-processing algorithms in aids to lipreading should not materially affect audio-visual speech understanding.
Wojdel, J.; Wiggers, P.; Rothkrantz, L.J.M.
This paper describes the gathering and availability of an audio-visual speech corpus for Dutch language. The corpus was prepared with the multi-modal speech recognition in mind and it is currently used in our research on lip-reading and bimodal speech recognition. It contains the prompts used also i
Kushnerenko, Elena; Tomalski, Przemyslaw; Ballieux, Haiko; Ribeiro, Helena; Potton, Anita; Axelsson, Emma L; Murphy, Elizabeth; Moore, Derek G
Research on audiovisual speech integration has reported high levels of individual variability, especially among young infants. In the present study we tested the hypothesis that this variability results from individual differences in the maturation of audiovisual speech processing during infancy. A developmental shift in selective attention to audiovisual speech has been demonstrated between 6 and 9 months with an increase in the time spent looking to articulating mouths as compared to eyes (Lewkowicz & Hansen-Tift. (2012) Proc. Natl Acad. Sci. USA, 109, 1431-1436; Tomalski et al. (2012) Eur. J. Dev. Psychol., 1-14). In the present study we tested whether these changes in behavioural maturational level are associated with differences in brain responses to audiovisual speech across this age range. We measured high-density event-related potentials (ERPs) in response to videos of audiovisually matching and mismatched syllables /ba/ and /ga/, and subsequently examined visual scanning of the same stimuli with eye-tracking. There were no clear age-specific changes in ERPs, but the amplitude of audiovisual mismatch response (AVMMR) to the combination of visual /ba/ and auditory /ga/ was strongly negatively associated with looking time to the mouth in the same condition. These results have significant implications for our understanding of individual differences in neural signatures of audiovisual speech processing in infants, suggesting that they are not strictly related to chronological age but instead associated with the maturation of looking behaviour, and develop at individual rates in the second half of the first year of life.
A. L. Oleinik
Full Text Available Subject of Research. The paper deals with the problem of lip region image reconstruction from speech signal by means of Partial Least Squares regression. Such problems arise in connection with development of audio-visual speech processing methods. Audio-visual speech consists of acoustic and visual components (called modalities. Applications of audio-visual speech processing methods include joint modeling of voice and lips’ movement dynamics, synchronization of audio and video streams, emotion recognition, liveness detection. Method. Partial Least Squares regression was applied to solve the posed problem. This method extracts components of initial data with high covariance. These components are used to build regression model. Advantage of this approach lies in the possibility of achieving two goals: identification of latent interrelations between initial data components (e.g. speech signal and lip region image and approximation of initial data component as a function of another one. Main Results. Experimental research on reconstruction of lip region images from speech signal was carried out on VidTIMIT audio-visual speech database. Results of the experiment showed that Partial Least Squares regression is capable of solving reconstruction problem. Practical Significance. Obtained findings give the possibility to assert that Partial Least Squares regression is successfully applicable for solution of vast variety of audio-visual speech processing problems: from synchronization of audio and video streams to liveness detection.
Saltuklaroglu, Tim; Kalinowski, Joseph; Dayalu, Vikram N; Stuart, Andrew; Rastatter, Michael P
In accord with a proposed innate link between speech perception and production (e.g., motor theory), this study provides compelling evidence for the inhibition of stuttering events in people who stutter prior to the initiation of the intended speech act, via both the perception and the production of speech gestures. Stuttering frequency during reading was reduced in 10 adults who stutter by approximately 40% in three of four experimental conditions: (1) following passive audiovisual presentation (i.e., viewing and hearing) of another person producing pseudostuttering (stutter-like syllabic repetitions) and following active shadowing of both (2) pseudostuttered and (3) fluent speech. Stuttering was not inhibited during reading following passive audiovisual presentation of fluent speech. Syllabic repetitions can inhibit stuttering both when produced and when perceived, and we suggest that these elementary stuttering forms may serve as compensatory speech gestures for releasing involuntary stuttering blocks by engaging mirror neuronal systems that are predisposed for fluent gestural imitation.
Full Text Available Apart from their remarkable phonological skills young infants prior to their first birthday show ability to match the mouth articulation they see with the speech sounds they hear. They are able to detect the audiovisual conflict of speech and to selectively attend to articulating mouth depending on audiovisual congruency. Early audiovisual speech processing is an important aspect of language development, related not only to phonological knowledge, but also to language production during subsequent years. Th is article reviews recent experimental work delineating the complex developmental trajectory of audiovisual mismatch detection. Th e central issue is the role of age-related changes in visual scanning of audiovisual speech and the corresponding changes in neural signatures of audiovisual speech processing in the second half of the first year of life. Th is phenomenon is discussed in the context of recent theories of perceptual development and existing data on the neural organisation of the infant ‘social brain’.
Banks, Briony; Gowen, Emma; Munro, Kevin J; Adank, Patti
Perceptual adaptation allows humans to recognize different varieties of accented speech. We investigated whether perceptual adaptation to accented speech is facilitated if listeners can see a speaker's facial and mouth movements. In Study 1, participants listened to sentences in a novel accent and underwent a period of training with audiovisual or audio-only speech cues, presented in quiet or in background noise. A control group also underwent training with visual-only (speech-reading) cues. We observed no significant difference in perceptual adaptation between any of the groups. To address a number of remaining questions, we carried out a second study using a different accent, speaker and experimental design, in which participants listened to sentences in a non-native (Japanese) accent with audiovisual or audio-only cues, without separate training. Participants' eye gaze was recorded to verify that they looked at the speaker's face during audiovisual trials. Recognition accuracy was significantly better for audiovisual than for audio-only stimuli; however, no statistical difference in perceptual adaptation was observed between the two modalities. Furthermore, Bayesian analysis suggested that the data supported the null hypothesis. Our results suggest that although the availability of visual speech cues may be immediately beneficial for recognition of unfamiliar accented speech in noise, it does not improve perceptual adaptation.
Roseboom, Warrick; Arnold, Derek H
Audiovisual timing perception can recalibrate following prolonged exposure to asynchronous auditory and visual inputs. It has been suggested that this might contribute to achieving perceptual synchrony for auditory and visual signals despite differences in physical and neural signal times for sight and sound. However, given that people can be concurrently exposed to multiple audiovisual stimuli with variable neural signal times, a mechanism that recalibrates all audiovisual timing percepts to a single timing relationship could be dysfunctional. In the experiments reported here, we showed that audiovisual temporal recalibration can be specific for particular audiovisual pairings. Participants were shown alternating movies of male and female actors containing positive and negative temporal asynchronies between the auditory and visual streams. We found that audiovisual synchrony estimates for each actor were shifted toward the preceding audiovisual timing relationship for that actor and that such temporal recalibrations occurred in positive and negative directions concurrently. Our results show that humans can form multiple concurrent estimates of appropriate timing for audiovisual synchrony.
Valkenier, Bea; Duyne, Jurriaan Y.; Andringa, Tjeerd C.; Baskent, Deniz
Purpose: Auditory perception of vowels in background noise is enhanced when combined with visually perceived speech features. The objective of this study was to investigate whether the influence of visual cues on vowel perception extends to incongruent vowels, in a manner similar to the McGurk effect observed with consonants. Method:…
Van Bree, K.C.
For many speech telecommunication technologies a robust speech activity detector is important. An audio-only speech detector will givefalse positives when the interfering signal is speech or has speech characteristics. The modality video is suitable to solve this problem. In this report the approach
Kronschnabel, Jens; Brem, Silvia; Maurer, Urs; Brandeis, Daniel
The classical phonological deficit account of dyslexia is increasingly linked to impairments in grapho-phonological conversion, and to dysfunctions in superior temporal regions associated with audiovisual integration. The present study investigates mechanisms of audiovisual integration in typical and impaired readers at the critical developmental stage of adolescence. Congruent and incongruent audiovisual as well as unimodal (visual only and auditory only) material was presented. Audiovisual presentations were single letters and three-letter (consonant-vowel-consonant) stimuli accompanied by matching or mismatching speech sounds. Three-letter stimuli exhibited fast phonetic transitions as in real-life language processing and reading. Congruency effects, i.e. different brain responses to congruent and incongruent stimuli were taken as an indicator of audiovisual integration at a phonetic level (grapho-phonological conversion). Comparisons of unimodal and audiovisual stimuli revealed basic, more sensory aspects of audiovisual integration. By means of these two criteria of audiovisual integration, the generalizability of audiovisual deficits in dyslexia was tested. Moreover, it was expected that the more naturalistic three-letter stimuli are superior to single letters in revealing group differences. Electrophysiological and hemodynamic (EEG and fMRI) data were acquired simultaneously in a simple target detection task. Applying the same statistical models to event-related EEG potentials and fMRI responses allowed comparing the effects detected by the two techniques at a descriptive level. Group differences in congruency effects (congruent against incongruent) were observed in regions involved in grapho-phonological processing, including the left inferior frontal and angular gyri and the inferotemporal cortex. Importantly, such differences also emerged in superior temporal key regions. Three-letter stimuli revealed stronger group differences than single letters. No
Full Text Available The present study compared elderly hearing aid (EHA users (n = 20 with elderly normal-hearing (ENH listeners (n = 20 in terms of isolation points (IPs, the shortest time required for correct identification of a speech stimulus and accuracy of audiovisual gated speech stimuli (consonants, words, and final words in highly and less predictable sentences presented in silence. In addition, we compared the IPs of audiovisual speech stimuli from the present study with auditory ones extracted from a previous study, to determine the impact of the addition of visual cues. Both participant groups achieved ceiling levels in terms of accuracy in the audiovisual identification of gated speech stimuli; however, the EHA group needed longer IPs for the audiovisual identification of consonants and words. The benefit of adding visual cues to auditory speech stimuli was more evident in the EHA group, as audiovisual presentation significantly shortened the IPs for consonants, words, and final words in less predictable sentences; in the ENH group, audiovisual presentation only shortened the IPs for consonants and words. In conclusion, although the audiovisual benefit was greater for EHA group, this group had inferior performance compared with the ENH group in terms of IPs when supportive semantic context was lacking. Consequently, EHA users needed the initial part of the audiovisual speech signal to be longer than did their counterparts with normal hearing to reach the same level of accuracy in the absence of a semantic context.
Moradi, Shahram; Lidestam, Björn; Rönnberg, Jerker
The present study compared elderly hearing aid (EHA) users (n = 20) with elderly normal-hearing (ENH) listeners (n = 20) in terms of isolation points (IPs, the shortest time required for correct identification of a speech stimulus) and accuracy of audiovisual gated speech stimuli (consonants, words, and final words in highly and less predictable sentences) presented in silence. In addition, we compared the IPs of audiovisual speech stimuli from the present study with auditory ones extracted from a previous study, to determine the impact of the addition of visual cues. Both participant groups achieved ceiling levels in terms of accuracy in the audiovisual identification of gated speech stimuli; however, the EHA group needed longer IPs for the audiovisual identification of consonants and words. The benefit of adding visual cues to auditory speech stimuli was more evident in the EHA group, as audiovisual presentation significantly shortened the IPs for consonants, words, and final words in less predictable sentences; in the ENH group, audiovisual presentation only shortened the IPs for consonants and words. In conclusion, although the audiovisual benefit was greater for EHA group, this group had inferior performance compared with the ENH group in terms of IPs when supportive semantic context was lacking. Consequently, EHA users needed the initial part of the audiovisual speech signal to be longer than did their counterparts with normal hearing to reach the same level of accuracy in the absence of a semantic context.
Bruderer, Alison G; Danielson, D Kyle; Kandhadai, Padmapriya; Werker, Janet F
The influence of speech production on speech perception is well established in adults. However, because adults have a long history of both perceiving and producing speech, the extent to which the perception-production linkage is due to experience is unknown. We addressed this issue by asking whether articulatory configurations can influence infants' speech perception performance. To eliminate influences from specific linguistic experience, we studied preverbal, 6-mo-old infants and tested the discrimination of a nonnative, and hence never-before-experienced, speech sound distinction. In three experimental studies, we used teething toys to control the position and movement of the tongue tip while the infants listened to the speech sounds. Using ultrasound imaging technology, we verified that the teething toys consistently and effectively constrained the movement and positioning of infants' tongues. With a looking-time procedure, we found that temporarily restraining infants' articulators impeded their discrimination of a nonnative consonant contrast but only when the relevant articulator was selectively restrained to prevent the movements associated with producing those sounds. Our results provide striking evidence that even before infants speak their first words and without specific listening experience, sensorimotor information from the articulators influences speech perception. These results transform theories of speech perception by suggesting that even at the initial stages of development, oral-motor movements influence speech sound discrimination. Moreover, an experimentally induced "impairment" in articulator movement can compromise speech perception performance, raising the question of whether long-term oral-motor impairments may impact perceptual development.
Tong Ming; Bian Zhengzhong; Li Xiaohui; Dai Qijun; Chen Yanpu
The perceptual effect of the phase information in speech has been studied by auditorysubjective tests. On the condition that the phase spectrum in speech is changed while amplitudespectrum is unchanged, the tests show that: (1) If the envelop of the reconstructed speech signalis unchanged, there is indistinctive auditory perception between the original speech and thereconstructed speech; (2) The auditory perception effect of the reconstructed speech mainly lieson the amplitude of the derivative of the additive phase; (3) td is the maximum relative time shiftbetween different frequency components of the reconstructed speech signal. The speech qualityis excellent while td ＜10ms; good while 10ms＜ td ＜20ms; common while 20ms＜ td ＜35ms, andpoor while td ＞35ms.
Full Text Available OBJECTIVE: To analyze speech reading through Internet video calls by profoundly hearing-impaired individuals and cochlear implant (CI users. METHODS: Speech reading skills of 14 deaf adults and 21 CI users were assessed using the Hochmair Schulz Moser (HSM sentence test. We presented video simulations using different video resolutions (1280 × 720, 640 × 480, 320 × 240, 160 × 120 px, frame rates (30, 20, 10, 7, 5 frames per second (fps, speech velocities (three different speakers, webcameras (Logitech Pro9000, C600 and C500 and image/sound delays (0-500 ms. All video simulations were presented with and without sound and in two screen sizes. Additionally, scores for live Skype™ video connection and live face-to-face communication were assessed. RESULTS: Higher frame rate (>7 fps, higher camera resolution (>640 × 480 px and shorter picture/sound delay (<100 ms were associated with increased speech perception scores. Scores were strongly dependent on the speaker but were not influenced by physical properties of the camera optics or the full screen mode. There is a significant median gain of +8.5%pts (p = 0.009 in speech perception for all 21 CI-users if visual cues are additionally shown. CI users with poor open set speech perception scores (n = 11 showed the greatest benefit under combined audio-visual presentation (median speech perception +11.8%pts, p = 0.032. CONCLUSION: Webcameras have the potential to improve telecommunication of hearing-impaired individuals.
Andersen, Tobias; Starrfelt, Randi
Lesions to Broca's area cause aphasia characterized by a severe impairment of the ability to speak, with comparatively intact speech perception. However, some studies have found effects on speech perception under adverse listening conditions, indicating that Broca's area is also involved in speec...
Using Moore's (1993) theory of transactional distance as a framework, this action research study explored students' perceptions of audiovisual feedback provided via screencasting as a supplement to text-only feedback. A crossover design was employed to ensure that all students experienced both text-only and text-plus-audiovisual feedback and to…
Full Text Available The sparse information captured by the sensory systems is used by the brain to apprehend the environment, for example, to spatially locate the source of audiovisual stimuli. This is an ill-posed inverse problem whose inherent uncertainty can be solved by jointly processing the information, as well as introducing constraints during this process, on the way this multisensory information is handled. This process and its result--the percept--depend on the contextual conditions perception takes place in. To date, perception has been investigated and modeled on the basis of either one of two of its dimensions: the percept or the temporal dynamics of the process. Here, we extend our previously proposed audiovisual perception model to predict both these dimensions to capture the phenomenon as a whole. Starting from a behavioral analysis, we use a data-driven approach to elicit a bayesian network which infers the different percepts and dynamics of the process. Context-specific independence analyses enable us to use the model's structure to directly explore how different contexts affect the way subjects handle the same available information. Hence, we establish that, while the percepts yielded by a unisensory stimulus or by the non-fusion of multisensory stimuli may be similar, they result from different processes, as shown by their differing temporal dynamics. Moreover, our model predicts the impact of bottom-up (stimulus driven factors as well as of top-down factors (induced by instruction manipulation on both the perception process and the percept itself.
Chen, Trevor H.
While the auditory-only aspects of Mandarin speech are heavily-researched and well-known in the field, this dissertation addresses its lesser-known aspects: The visual and audio-visual perception of Mandarin segmental information and lexical-tone information. Chapter II of this dissertation focuses on the audiovisual perception of Mandarin…
Full Text Available The paper deals with analytical review, covering the latest achievements in the field of audio-visual (AV fusion (integration of multimodal information. We discuss the main challenges and report on approaches to address them. One of the most important tasks of the AV integration is to understand how the modalities interact and influence each other. The paper addresses this problem in the context of AV speech processing and speech recognition. In the first part of the review we set out the basic principles of AV speech recognition and give the classification of audio and visual features of speech. Special attention is paid to the systematization of the existing techniques and the AV data fusion methods. In the second part we provide a consolidated list of tasks and applications that use the AV fusion based on carried out analysis of research area. We also indicate used methods, techniques, audio and video features. We propose classification of the AV integration, and discuss the advantages and disadvantages of different approaches. We draw conclusions and offer our assessment of the future in the field of AV fusion. In the further research we plan to implement a system of audio-visual Russian continuous speech recognition using advanced methods of multimodal fusion.
Wang, DeLiang; Kjems, Ulrik; Pedersen, Michael Syskind;
For a given mixture of speech and noise, an ideal binary time-frequency mask is constructed by comparing speech energy and noise energy within local time-frequency units. It is observed that listeners achieve nearly perfect speech recognition from gated noise with binary gains prescribed...... by the ideal binary mask. Only 16 filter channels and a frame rate of 100 Hz are sufficient for high intelligibility. The results show that, despite a dramatic reduction of speech information, a pattern of binary gains provides an adequate basis for speech perception....
Maidment, David W.; Kang, Hi Jee; Stewart, Hannah J.; Amitay, Sygal
Purpose: The study explored whether visual information improves speech identification in typically developing children with normal hearing when the auditory signal is spectrally degraded. Method: Children (n = 69) and adults (n = 15) were presented with noise-vocoded sentences from the Children's Co-ordinate Response Measure (Rosen, 2011) in…
Lynne E Bernstein
Full Text Available This paper examines the questions, what levels of speech can be perceived visually, and how is visual speech represented by the brain? Review of the literature leads to the conclusions that every level of psycholinguistic speech structure (i.e., phonetic features, phonemes, syllables, words, and prosody can be perceived visually, although individuals differ in their abilities to do so; and that there are visual modality-specific representations of speech qua speech in higher-level vision brain areas. That is, the visual system represents the modal patterns of visual speech. The suggestion that the auditory speech pathway receives and represents visual speech is examined in light of neuroimaging evidence on the auditory speech pathways. We outline the generally agreed-upon organization of the visual ventral and dorsal pathways and examine several types of visual processing that might be related to speech through those pathways, specifically, face and body, orthography, and sign language processing. In this context, we examine the visual speech processing literature, which reveals widespread diverse patterns activity in posterior temporal cortices in response to visual speech stimuli. We outline a model of the visual and auditory speech pathways and make several suggestions: (1 The visual perception of speech relies on visual pathway representations of speech qua speech. (2 A proposed site of these representations, the temporal visual speech area (TVSA has been demonstrated in posterior temporal cortex, ventral and posterior to multisensory posterior superior temporal sulcus (pSTS. (3 Given that visual speech has dynamic and configural features, its representations in feedforward visual pathways are expected to integrate these features, possibly in TVSA.
Full Text Available One view of speech perception is that acoustic signals are transformed into representations for pattern matching to determine linguistic structure. This process can be taken as a statistical pattern-matching problem, assuming realtively stable linguistic categories are characterized by neural representations related to auditory properties of speech that can be compared to speech input. This kind of pattern matching can be termed a passive process which implies rigidity of processingd with few demands on cognitive processing. An alternative view is that speech recognition, even in early stages, is an active process in which speech analysis is attentionally guided. Note that this does not mean consciously guided but that information-contingent changes in early auditory encoding can occur as a function of context and experience. Active processing assumes that attention, plasticity, and listening goals are important in considering how listeners cope with adverse circumstances that impair hearing by masking noise in the environment or hearing loss. Although theories of speech perception have begun to incorporate some active processing, they seldom treat early speech encoding as plastic and attentionally guided. Recent research has suggested that speech perception is the product of both feedforward and feedback interactions between a number of brain regions that include descending projections perhaps as far downstream as the cochlea. It is important to understand how the ambiguity of the speech signal and constraints of context dynamically determine cognitive resources recruited during perception including focused attention, learning, and working memory. Theories of speech perception need to go beyond the current corticocentric approach in order to account for the intrinsic dynamics of the auditory encoding of speech. In doing so, this may provide new insights into ways in which hearing disorders and loss may be treated either through augementation or
Dubois, Cyril; Otzenberger, Helene; Gounot, Daniel; Sock, Rudolph; Metz-Lutz, Marie-Noelle
In a noisy environment, visual perception of articulatory movements improves natural speech intelligibility. Parallel to phonemic processing based on auditory signal, visemic processing constitutes a counterpart based on "visemes", the distinctive visual units of speech. Aiming at investigating the neural substrates of visemic processing in a…
Lusk, Laina G; Mitchel, Aaron D
Speech is inextricably multisensory: both auditory and visual components provide critical information for all aspects of speech processing, including speech segmentation, the visual components of which have been the target of a growing number of studies. In particular, a recent study (Mitchel and Weiss, 2014) established that adults can utilize facial cues (i.e., visual prosody) to identify word boundaries in fluent speech. The current study expanded upon these results, using an eye tracker to identify highly attended facial features of the audiovisual display used in Mitchel and Weiss (2014). Subjects spent the most time watching the eyes and mouth. A significant trend in gaze durations was found with the longest gaze duration on the mouth, followed by the eyes and then the nose. In addition, eye-gaze patterns changed across familiarization as subjects learned the word boundaries, showing decreased attention to the mouth in later blocks while attention on other facial features remained consistent. These findings highlight the importance of the visual component of speech processing and suggest that the mouth may play a critical role in visual speech segmentation.
A. A. Karpov
Full Text Available We present a conceptual model, architecture and software of a multimodal system for audio-visual speech and sign language synthesis by the input text. The main components of the developed multimodal synthesis system (signing avatar are: automatic text processor for input text analysis; simulation 3D model of human's head; computer text-to-speech synthesizer; a system for audio-visual speech synthesis; simulation 3D model of human’s hands and upper body; multimodal user interface integrating all the components for generation of audio, visual and signed speech. The proposed system performs automatic translation of input textual information into speech (audio information and gestures (video information, information fusion and its output in the form of multimedia information. A user can input any grammatically correct text in Russian or Czech languages to the system; it is analyzed by the text processor to detect sentences, words and characters. Then this textual information is converted into symbols of the sign language notation. We apply international «Hamburg Notation System» - HamNoSys, which describes the main differential features of each manual sign: hand shape, hand orientation, place and type of movement. On their basis the 3D signing avatar displays the elements of the sign language. The virtual 3D model of human’s head and upper body has been created using VRML virtual reality modeling language, and it is controlled by the software based on OpenGL graphical library. The developed multimodal synthesis system is a universal one since it is oriented for both regular users and disabled people (in particular, for the hard-of-hearing and visually impaired, and it serves for multimedia output (by audio and visual modalities of input textual information.
, in the light of the rich and complex Danish sound system. The first two studies report on native adults’ perception of Danish speech sounds in quiet and noise. The third study examined the development of language-specific perception in native Danish infants at 6, 9 and 12 months of age. The book points...... to interesting differences in speech perception and acquisition of Danish adults and infants when compared to English. The book is useful for professionals as well as students of linguistics, psycholinguistics and phonetics/phonology, or anyone else who may be interested in language.......Little is known about the perception of speech sounds by native Danish listeners. However, the Danish sound system differs in several interesting ways from the sound systems of other languages. For instance, Danish is characterized, among other features, by a rich vowel inventory and by different...
Research in the field of speech production pathology is dominated by describing deficits in output. However, perceptual problems might underlie, precede, or interact with production disorders. The present study hypothesizes that the level of the production disorders is linked to level of perception
Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.
Full Text Available This paper proposes an audio-visual speech recognition method using lip information extracted from side-face images as an attempt to increase noise robustness in mobile environments. Our proposed method assumes that lip images can be captured using a small camera installed in a handset. Two different kinds of lip features, lip-contour geometric features and lip-motion velocity features, are used individually or jointly, in combination with audio features. Phoneme HMMs modeling the audio and visual features are built based on the multistream HMM technique. Experiments conducted using Japanese connected digit speech contaminated with white noise in various SNR conditions show effectiveness of the proposed method. Recognition accuracy is improved by using the visual information in all SNR conditions. These visual features were confirmed to be effective even when the audio HMM was adapted to noise by the MLLR method.
rqvcrw it necessary and idenbty by bloc* numewrl ,ELI I ROu I SUS. GR. l speech perception, prosody, context effects, phonetic 05 09 1 segments...found to aid listeners in correctly attributing the phonological source of vowel duration. The second series of experiments examines the role of... phonetic segments, and on the role of coarse-grained aspects of the speech signal in facilitating segment recognition. These extensions will address the
Holt, Lori L.
Despite a long and rich history of categorization research in cognitive psychology, very little work has addressed the issue of complex auditory category formation. This is especially unfortunate because the general underlying cognitive and perceptual mechanisms that guide auditory category formation are of great importance to understanding speech perception. I will discuss a new methodological approach to examining complex auditory category formation that specifically addresses issues relevant to speech perception. This approach utilizes novel nonspeech sound stimuli to gain full experimental control over listeners' history of experience. As such, the course of learning is readily measurable. Results from this methodology indicate that the structure and formation of auditory categories are a function of the statistical input distributions of sound that listeners hear, aspects of the operating characteristics of the auditory system, and characteristics of the perceptual categorization system. These results have important implications for phonetic acquisition and speech perception.
Lotto, Andrew J; Hickok, Gregory S; Holt, Lori L
The discovery of mirror neurons, a class of neurons that respond when a monkey performs an action and also when the monkey observes others producing the same action, has promoted a renaissance for the Motor Theory (MT) of speech perception. This is because mirror neurons seem to accomplish the same kind of one to one mapping between perception and action that MT theorizes to be the basis of human speech communication. However, this seeming correspondence is superficial, and there are theoretical and empirical reasons to temper enthusiasm about the explanatory role mirror neurons might have for speech perception. In fact, rather than providing support for MT, mirror neurons are actually inconsistent with the central tenets of MT.
Alan James Power
Full Text Available Auditory cortical oscillations have been proposed to play an important role in speech perception. It is suggested that the brain may take temporal ‘samples’ of information from the speech stream at different rates, phase-resetting ongoing oscillations so that they are aligned with similar frequency bands in the input (‘phase locking’. Information from these frequency bands is then bound together for speech perception. To date, there are no explorations of neural phase-locking and entrainment to speech input in children. However, it is clear from studies of language acquisition that infants use both visual speech information and auditory speech information in learning. In order to study neural entrainment to speech in typically-developing children, we use a rhythmic entrainment paradigm (underlying 2 Hz or delta rate based on repetition of the syllable ba, presented in either the auditory modality alone, the visual modality alone, or as auditory-visual speech (via a talking head. To ensure attention to the task, children aged 13 years were asked to press a button as fast as possible when the ba stimulus violated the rhythm for each stream type. Rhythmic violation depended on delaying the occurrence of a ba in the isochronous stream. Neural entrainment was demonstrated for all stream types, and individual differences in standardized measures of language processing were related to auditory entrainment at the theta rate. Further, there was significant modulation of the preferred phase of auditory entrainment in the theta band when visual speech cues were present, indicating cross-modal phase resetting. The rhythmic entrainment paradigm developed here offers a method for exploring individual differences in oscillatory phase locking during development. In particular, a method for assessing neural entrainment and cross-modal phase resetting would be useful for exploring developmental learning difficulties thought to involve temporal sampling
Li, Yuanqing; Long, Jinyi; Huang, Biao; Yu, Tianyou; Wu, Wei; Liu, Yongjian; Liang, Changhong; Sun, Pei
Previous studies have shown that audiovisual integration improves identification performance and enhances neural activity in heteromodal brain areas, for example, the posterior superior temporal sulcus/middle temporal gyrus (pSTS/MTG). Furthermore, it has also been demonstrated that attention plays an important role in crossmodal integration. In this study, we considered crossmodal integration in audiovisual facial perception and explored its effect on the neural representation of features. The audiovisual stimuli in the experiment consisted of facial movie clips that could be classified into 2 gender categories (male vs. female) or 2 emotion categories (crying vs. laughing). The visual/auditory-only stimuli were created from these movie clips by removing the auditory/visual contents. The subjects needed to make a judgment about the gender/emotion category for each movie clip in the audiovisual, visual-only, or auditory-only stimulus condition as functional magnetic resonance imaging (fMRI) signals were recorded. The neural representation of the gender/emotion feature was assessed using the decoding accuracy and the brain pattern-related reproducibility indices, obtained by a multivariate pattern analysis method from the fMRI data. In comparison to the visual-only and auditory-only stimulus conditions, we found that audiovisual integration enhanced the neural representation of task-relevant features and that feature-selective attention might play a role of modulation in the audiovisual integration.
Full Text Available Initially, infants are capable of discriminating phonetic contrasts across the world’s languages. Starting between seven and ten months of age, they gradually lose this ability through a process of perceptual narrowing. Although traditionally investigated with isolated speech sounds, such narrowing occurs in a variety of perceptual domains (e.g., faces, visual speech. Thus far, tracking the developmental trajectory of this tuning process has been focused primarily on auditory speech alone, and generally using isolated sounds. But infants learn from speech produced by people talking to them, meaning they learn from a complex audiovisual signal. Here, we use near-infrared spectroscopy to measure blood concentration changes in the bilateral temporal cortices of infants in three different age groups: 3-to-6 months, 7-to-10 months, and 11-to-14-months. Critically, all three groups of infants were tested with continuous audiovisual speech in both their native and another, unfamiliar language. We found that at each age range, infants showed different patterns of cortical activity in response to the native and non-native stimuli. Infants in the youngest group showed bilateral cortical activity that was greater overall in response to non-native relative to native speech; the oldest group showed left lateralized activity in response to native relative to non-native speech. These results highlight perceptual tuning as a dynamic process that happens across modalities and at different levels of stimulus complexity.
Elena V Kushnerenko
Full Text Available The use of visual cues during the processing of audiovisual speech is known to be less efficient in children and adults with language difficulties and difficulties are known to be more prevalent in children from low-income populations. In the present study, we followed an economically diverse group of thirty-seven infants longitudinally from 6-9 months to 14-16 months of age. We used eye-tracking to examine whether individual differences in visual attention during audiovisual processing of speech in 6 to 9 month old infants, particularly when processing congruent and incongruent auditory and visual speech cues, might be indicative of their later language development. Twenty-two of these 6-9 month old infants also participated in an event-related potential (ERP audiovisual task within the same experimental session. Language development was then followed-up at the age of 14-16 months, using two measures of language development, the Preschool Language Scale (PLS and the Oxford Communicative Development Inventory (CDI. The results show that those infants who were less efficient in auditory speech processing at the age of 6-9 months had lower receptive language scores at 14-16 months. A correlational analysis revealed that the pattern of face scanning and ERP responses to audio-visually incongruent stimuli at 6-9 months were both significantly associated with language development at 14-16 months. These findings add to the understanding of individual differences in neural signatures of audiovisual processing and associated looking behaviour in infants.
Campos-Sánchez, Antonio; López-Núñez, Juan-Antonio; Scionti, Giuseppe; Garzón, Ingrid; González-Andrades, Miguel; Alaminos, Miguel; Sola, Tomás
Videos can be used as didactic tools for self-learning under several circumstances, including those cases in which students are responsible for the development of this resource as an audiovisual notebook. We compared students' and teachers' perceptions regarding the main features that an audiovisual notebook should include. Four…
Gick, Bryan; Derrick, Donald
Visual information from a speaker’s face can enhance1 or interfere with2 accurate auditory perception. This integration of information across auditory and visual streams has been observed in functional imaging studies3,4, and has typically been attributed to the frequency and robustness with which perceivers jointly encounter event-specific information from these two modalities5. Adding the tactile modality has long been considered a crucial next step in understanding multisensory integration. However, previous studies have found an influence of tactile input on speech perception only under limited circumstances, either where perceivers were aware of the task6,7 or where they had received training to establish a cross-modal mapping8–10. Here we show that perceivers integrate naturalistic tactile information during auditory speech perception without previous training. Drawing on the observation that some speech sounds produce tiny bursts of aspiration (such as English ‘p’)11, we applied slight, inaudible air puffs on participants’ skin at one of two locations: the right hand or the neck. Syllables heard simultaneously with cutaneous air puffs were more likely to be heard as aspirated (for example, causing participants to mishear ‘b’ as ‘p’). These results demonstrate that perceivers integrate event-relevant tactile information in auditory perception in much the same way as they do visual information. PMID:19940925
This book presents a new approach to examining perceived quality of audiovisual sequences. It uses electroencephalography to understand how exactly user quality judgments are formed within a test participant, and what might be the physiologically-based implications when being exposed to lower quality media. The book redefines experimental paradigms of using EEG in the area of quality assessment so that they better suit the requirements of standard subjective quality testings. Therefore, experimental protocols and stimuli are adjusted accordingly. .
Chen, Yi-Chuan; Shore, David I; Lewis, Terri L; Maurer, Daphne
We measured the typical developmental trajectory of the window of audiovisual simultaneity by testing four age groups of children (5, 7, 9, and 11 years) and adults. We presented a visual flash and an auditory noise burst at various stimulus onset asynchronies (SOAs) and asked participants to report whether the two stimuli were presented at the same time. Compared with adults, children aged 5 and 7 years made more simultaneous responses when the SOAs were beyond ± 200 ms but made fewer simultaneous responses at the 0 ms SOA. The point of subjective simultaneity was located at the visual-leading side, as in adults, by 5 years of age, the youngest age tested. However, the window of audiovisual simultaneity became narrower and response errors decreased with age, reaching adult levels by 9 years of age. Experiment 2 ruled out the possibility that the adult-like performance of 9-year-old children was caused by the testing of a wide range of SOAs. Together, the results demonstrate that the adult-like precision of perceiving audiovisual simultaneity is developed by 9 years of age, the youngest age that has been reported to date.
de Kok, D.A.; Jonkers, R.; Bastiaanse, Y.R.M.
Individuals with aphasia have more problems detecting small differences between speech sounds than larger ones. This paper reports how phonemic processing is impaired and how this is influenced by speechreading. A non-word discrimination task was carried out with 'audiovisual', 'auditory only' and '
Munhall, K G; Servos, P; Santi, A; Goodale, M A
To examine the role of dynamic cues in visual speech perception, a patient with visual form agnosia (DF) was tested with a set of static and dynamic visual displays of three vowels. Five conditions were tested: (1) auditory only which provided only vocal pitch information, (2) dynamic visual only, (3) dynamic audiovisual with vocal pitch information, (4) dynamic audiovisual with full voice information and (5) static visual only images of postures during vowel production. DF showed normal performance in all conditions except the static visual only condition in which she scored at chance. Control subjects scored close to ceiling in this condition. The results suggest that spatiotemporal signatures for objects and events are processed separately from static form cues.
Song, Judy H; Skoe, Erika; Banai, Karen; Kraus, Nina
The presence of irrelevant auditory information (other talkers, environmental noises) presents a major challenge to listening to speech. The fundamental frequency (F(0)) of the target speaker is thought to provide an important cue for the extraction of the speaker's voice from background noise, but little is known about the relationship between speech-in-noise (SIN) perceptual ability and neural encoding of the F(0). Motivated by recent findings that music and language experience enhance brainstem representation of sound, we examined the hypothesis that brainstem encoding of the F(0) is diminished to a greater degree by background noise in people with poorer perceptual abilities in noise. To this end, we measured speech-evoked auditory brainstem responses to /da/ in quiet and two multitalker babble conditions (two-talker and six-talker) in native English-speaking young adults who ranged in their ability to perceive and recall SIN. Listeners who were poorer performers on a standardized SIN measure demonstrated greater susceptibility to the degradative effects of noise on the neural encoding of the F(0). Particularly diminished was their phase-locked activity to the fundamental frequency in the portion of the syllable known to be most vulnerable to perceptual disruption (i.e., the formant transition period). Our findings suggest that the subcortical representation of the F(0) in noise contributes to the perception of speech in noisy conditions.
Full Text Available Motion perception is a pervasive nature of vision and is affected by both immediate pattern of sensory inputs and prior experiences acquired through associations. Recently, several studies reported that an association can be established quickly between directions of visual motion and static sounds of distinct frequencies. After the association is formed, sounds are able to change the perceived direction of visual motion. To determine whether such rapidly acquired audiovisual associations and their subsequent influences on visual motion perception are dependent on the involvement of higher-order attentive tracking mechanisms, we designed psychophysical experiments using regular and reverse-phi random dot motions isolating low-level pre-attentive motion processing. Our results show that an association between the directions of low-level visual motion and static sounds can be formed and this audiovisual association alters the subsequent perception of low-level visual motion. These findings support the view that audiovisual associations are not restricted to high-level attention based motion system and early-level visual motion processing has some potential role.
Kafaligonul, Hulusi; Oluk, Can
Motion perception is a pervasive nature of vision and is affected by both immediate pattern of sensory inputs and prior experiences acquired through associations. Recently, several studies reported that an association can be established quickly between directions of visual motion and static sounds of distinct frequencies. After the association is formed, sounds are able to change the perceived direction of visual motion. To determine whether such rapidly acquired audiovisual associations and their subsequent influences on visual motion perception are dependent on the involvement of higher-order attentive tracking mechanisms, we designed psychophysical experiments using regular and reverse-phi random dot motions isolating low-level pre-attentive motion processing. Our results show that an association between the directions of low-level visual motion and static sounds can be formed and this audiovisual association alters the subsequent perception of low-level visual motion. These findings support the view that audiovisual associations are not restricted to high-level attention based motion system and early-level visual motion processing has some potential role.
Morrill, Ryan J.; Paukner, Annika; Ferrari, Pier F.; Ghazanfar, Asif A.
Across all languages studied to date, audiovisual speech exhibits a consistent rhythmic structure. This rhythm is critical to speech perception. Some have suggested that the speech rhythm evolved "de novo" in humans. An alternative account--the one we explored here--is that the rhythm of speech evolved through the modification of rhythmic facial…
Full Text Available One of the main issues within the field of social robotics is to endow robots with the ability to direct attention to people with whom they are interacting. Different approaches follow bio-inspired mechanisms, merging audio and visual cues to localize a person using multiple sensors. However, most of these fusion mechanisms have been used in fixed systems, such as those used in video-conference rooms, and thus, they may incur difficulties when constrained to the sensors with which a robot can be equipped. Besides, within the scope of interactive autonomous robots, there is a lack in terms of evaluating the benefits of audio-visual attention mechanisms, compared to only audio or visual approaches, in real scenarios. Most of the tests conducted have been within controlled environments, at short distances and/or with off-line performance measurements. With the goal of demonstrating the benefit of fusing sensory information with a Bayes inference for interactive robotics, this paper presents a system for localizing a person by processing visual and audio data. Moreover, the performance of this system is evaluated and compared via considering the technical limitations of unimodal systems. The experiments show the promise of the proposed approach for the proactive detection and tracking of speakers in a human-robot interactive framework.
Viciana-Abad, Raquel; Marfil, Rebeca; Perez-Lorenzo, Jose M; Bandera, Juan P; Romero-Garces, Adrian; Reche-Lopez, Pedro
One of the main issues within the field of social robotics is to endow robots with the ability to direct attention to people with whom they are interacting. Different approaches follow bio-inspired mechanisms, merging audio and visual cues to localize a person using multiple sensors. However, most of these fusion mechanisms have been used in fixed systems, such as those used in video-conference rooms, and thus, they may incur difficulties when constrained to the sensors with which a robot can be equipped. Besides, within the scope of interactive autonomous robots, there is a lack in terms of evaluating the benefits of audio-visual attention mechanisms, compared to only audio or visual approaches, in real scenarios. Most of the tests conducted have been within controlled environments, at short distances and/or with off-line performance measurements. With the goal of demonstrating the benefit of fusing sensory information with a Bayes inference for interactive robotics, this paper presents a system for localizing a person by processing visual and audio data. Moreover, the performance of this system is evaluated and compared via considering the technical limitations of unimodal systems. The experiments show the promise of the proposed approach for the proactive detection and tracking of speakers in a human-robot interactive framework.
Marchant, Jennifer L; Driver, Jon
Understanding how the brain extracts and combines temporal structure (rhythm) information from events presented to different senses remains unresolved. Many neuroimaging beat perception studies have focused on the auditory domain and show the presence of a highly regular beat (isochrony) in "auditory" stimulus streams enhances neural responses in a distributed brain network and affects perceptual performance. Here, we acquired functional magnetic resonance imaging (fMRI) measurements of brain activity while healthy human participants performed a visual task on isochronous versus randomly timed "visual" streams, with or without concurrent task-irrelevant sounds. We found that visual detection of higher intensity oddball targets was better for isochronous than randomly timed streams, extending previous auditory findings to vision. The impact of isochrony on visual target sensitivity correlated positively with fMRI signal changes not only in visual cortex but also in auditory sensory cortex during audiovisual presentations. Visual isochrony activated a similar timing-related brain network to that previously found primarily in auditory beat perception work. Finally, activity in multisensory left posterior superior temporal sulcus increased specifically during concurrent isochronous audiovisual presentations. These results indicate that regular isochronous timing can modulate visual processing and this can also involve multisensory audiovisual brain mechanisms.
Full Text Available The aim of this experiment was to investigate the influence of musical expertise on the automatic perception of foreign syllables and harmonic sounds. Participants were Cuban students with high level of expertise in music or in visual arts and with the same level of general education and socio-economic background. We used a multi-feature Mismatch Negativity (MMN design with sequences of either syllables in Mandarin Chinese or harmonic sounds, both comprising deviants in pitch contour, duration and Voice Onset Time (VOT or equivalent that were either far from (Large deviants or close to (Small deviants the standard. For both Mandarin syllables and harmonic sounds, results were clear-cut in showing larger MMNs to pitch contour deviants in musicians than in visual artists. Results were less clear for duration and VOT deviants, possibly because of the specific characteristics of the stimuli. Results are interpreted as reflecting similar processing of pitch contour in speech and non-speech sounds. The implications of these results for understanding the influence of intense musical training from childhood to adulthood and of genetic predispositions for music on foreign language perception is discussed.
Bertoncini, J; Cabrera, L
The development of speech perception relies upon early auditory capacities (i.e. discrimination, segmentation and representation). Infants are able to discriminate most of the phonetic contrasts occurring in natural languages, and at the end of the first year, this universal ability starts to narrow down to the contrasts used in the environmental language. During the second year, this specialization is characterized by the development of comprehension, lexical organization and word production. That process appears now as the result of multiple interactions between perceptual, cognitive and social developing abilities. Distinct factors like word acquisition, sensitivity to the statistical properties of the input, or even the nature of the social interactions, might play a role at one time or another during the acquisition of phonological patterns. Experience with the native language is necessary for phonetic segments to be functional units of perception and for speech sound representations (words, syllables) to be more specified and phonetically organized. This evolution goes on beyond 24 months of age in a learning context characterized from the early stages by the interaction with other developing (linguistic and non-linguistic) capacities.
Foundations of Voice and Speech Quality Perception starts out with the fundamental question of: "How do listeners perceive voice and speech quality and how can these processes be modeled?" Any quantitative answers require measurements. This is natural for physical quantities but harder to imagine for perceptual measurands. This book approaches the problem by actually identifying major perceptual dimensions of voice and speech quality perception, defining units wherever possible and offering paradigms to position these dimensions into a structural skeleton of perceptual speech and voice quality. The emphasis is placed on voice and speech quality assessment of systems in artificial scenarios. Many scientific fields are involved. This book bridges the gap between two quite diverse fields, engineering and humanities, and establishes the new research area of Voice and Speech Quality Perception.
Venezia, Jonathan H; Fillmore, Paul; Matchin, William; Isenberg, A Lisette; Hickok, Gregory; Fridriksson, Julius
Sensory information is critical for movement control, both for defining the targets of actions and providing feedback during planning or ongoing movements. This holds for speech motor control as well, where both auditory and somatosensory information have been shown to play a key role. Recent clinical research demonstrates that individuals with severe speech production deficits can show a dramatic improvement in fluency during online mimicking of an audiovisual speech signal suggesting the existence of a visuomotor pathway for speech motor control. Here we used fMRI in healthy individuals to identify this new visuomotor circuit for speech production. Participants were asked to perceive and covertly rehearse nonsense syllable sequences presented auditorily, visually, or audiovisually. The motor act of rehearsal, which is prima facie the same whether or not it is cued with a visible talker, produced different patterns of sensorimotor activation when cued by visual or audiovisual speech (relative to auditory speech). In particular, a network of brain regions including the left posterior middle temporal gyrus and several frontoparietal sensorimotor areas activated more strongly during rehearsal cued by a visible talker versus rehearsal cued by auditory speech alone. Some of these brain regions responded exclusively to rehearsal cued by visual or audiovisual speech. This result has significant implications for models of speech motor control, for the treatment of speech output disorders, and for models of the role of speech gesture imitation in development.
Kaganovich, Natalya; Schumaker, Jennifer
Sensitivity to the temporal relationship between auditory and visual stimuli is key to efficient audiovisual integration. However, even adults vary greatly in their ability to detect audiovisual temporal asynchrony. What underlies this variability is currently unknown. We recorded event-related potentials (ERPs) while participants performed a simultaneity judgment task on a range of audiovisual (AV) and visual-auditory (VA) stimulus onset asynchronies (SOAs) and compared ERP responses in good and poor performers to the 200ms SOA, which showed the largest individual variability in the number of synchronous perceptions. Analysis of ERPs to the VA200 stimulus yielded no significant results. However, those individuals who were more sensitive to the AV200 SOA had significantly more positive voltage between 210 and 270ms following the sound onset. In a follow-up analysis, we showed that the mean voltage within this window predicted approximately 36% of variability in sensitivity to AV temporal asynchrony in a larger group of participants. The relationship between the ERP measure in the 210-270ms window and accuracy on the simultaneity judgment task also held for two other AV SOAs with significant individual variability -100 and 300ms. Because the identified window was time-locked to the onset of sound in the AV stimulus, we conclude that sensitivity to AV temporal asynchrony is shaped to a large extent by the efficiency in the neural encoding of sound onsets.
Full Text Available Young infants are sensitive to multisensory temporal synchrony relations, but the neural dynamics of temporal interactions between vision and audition in infancy are not well understood. We investigated audiovisual synchrony and asynchrony perception in 6-month-old infants using event-related potentials (ERP. In a prior behavioral experiment (n = 45, infants were habituated to an audiovisual synchronous stimulus and tested for recovery of interest by presenting an asynchronous test stimulus in which the visual stream was delayed with respect to the auditory stream by 400 ms. Infants who behaviorally discriminated the change in temporal alignment were included in further analyses. In the EEG experiment (final sample: n = 15, synchronous and asynchronous stimuli (visual delay of 400 ms were presented in random order. Results show latency shifts in the auditory ERP components N1 and P2 as well as the infant ERP component Nc. Latencies in the asynchronous condition were significantly longer than in the synchronous condition. After video onset but preceding the auditory onset, amplitude modulations propagating from posterior to anterior sites and related to the Pb component of infants' ERP were observed. Results suggest temporal interactions between the two modalities. Specifically, they point to the significance of anticipatory visual motion for auditory processing, and indicate young infants’ predictive capacities for audiovisual temporal synchrony relations.
Full Text Available Congenital amusics, or tone-deaf individuals, show difficulty in perceiving and producing small pitch differences. While amusia has marked effects on music perception, its impact on speech perception is less clear. Here we test the hypothesis that individual differences in pitch perception affect judgment of emotion in speech, by applying band-pass filters to spoken statements of emotional speech. A norming study was first conducted on Mechanical Turk to ensure that the intended emotions from the Macquarie Battery for Evaluation of Prosody (MBEP were reliably identifiable by US English speakers. The most reliably identified emotional speech samples were used in in Experiment 1, in which subjects performed a psychophysical pitch discrimination task, and an emotion identification task under band-pass and unfiltered speech conditions. Results showed a significant correlation between pitch discrimination threshold and emotion identification accuracy for band-pass filtered speech, with amusics (defined here as those with a pitch discrimination threshold > 16 Hz performing worse than controls. This relationship with pitch discrimination was not seen in unfiltered speech conditions. Given the dissociation between band-pass filtered and unfiltered speech conditions, we inferred that amusics may be compensating for poorer pitch perception by using speech cues that are filtered out in this manipulation.
Galvin, John J.; Fu, Qian-Jie
Combined use of a hearing aid (HA) and cochlear implant (CI) has been shown to improve CI users’ speech and music performance. However, different hearing devices, test stimuli, and listening tasks may interact and obscure bimodal benefits. In this study, speech and music perception were measured in bimodal listeners for CI-only, HA-only, and CI + HA conditions, using the Sung Speech Corpus, a database of monosyllabic words produced at different fundamental frequencies. Sentence recognition was measured using sung speech in which pitch was held constant or varied across words, as well as for spoken speech. Melodic contour identification (MCI) was measured using sung speech in which the words were held constant or varied across notes. Results showed that sentence recognition was poorer with sung speech relative to spoken, with little difference between sung speech with a constant or variable pitch; mean performance was better with CI-only relative to HA-only, and best with CI + HA. MCI performance was better with constant words versus variable words; mean performance was better with HA-only than with CI-only and was best with CI + HA. Relative to CI-only, a strong bimodal benefit was observed for speech and music perception. Relative to the better ear, bimodal benefits remained strong for sentence recognition but were marginal for MCI. While variations in pitch and timbre may negatively affect CI users’ speech and music perception, bimodal listening may partially compensate for these deficits. PMID:27837051
Conboy, Barbara T.; Sommerville, Jessica A.; Kuhl, Patricia K.
The development of speech perception during the 1st year reflects increasing attunement to native language features, but the mechanisms underlying this development are not completely understood. One previous study linked reductions in nonnative speech discrimination to performance on nonlinguistic tasks, whereas other studies have shown…
Biau, Emmanuel; Soto-Faraco, Salvador
Spontaneous beat gestures are an integral part of the paralinguistic context during face-to-face conversations. Here we investigated the time course of beat-speech integration in speech perception by measuring ERPs evoked by words pronounced with or without an accompanying beat gesture, while participants watched a spoken discourse. Words…
Full Text Available Audio-visual speech recognition (AVSR using acoustic and visual signals of speech have received attention recently because of its robustness in noisy environments. Perceptual studies also support this approach by emphasizing the importance of visual information for speech recognition in humans. An important issue in decision fusion based AVSR system is how to obtain the appropriate integration weight for the speech modalities to integrate and ensure the combined AVSR system’s performances better than that of the audio-only and visual-only systems under various noise conditions. To solve this issue, we present a genetic algorithm (GA based optimization scheme to obtain the appropriate integration weight from the relative reliability of each modality. The performance of the proposed GA optimized reliability-ratio based weight estimation scheme is demonstrated via single speaker, mobile functions isolated word recognition experiments. The results show that the proposed scheme improves robust recognition accuracy over the conventional unimodal systems and the baseline reliability ratio-based AVSR system under various signal to noise ratio conditions.
BU Fanliang; CHEN Yanpu
As the human ear is dull to the phase in speech, little attention has been paid tophase information in speech coding. In fact, the speech perceptual quality may be degeneratedif the phase distortion is very large. The perceptual effect of the STFT (Short time Fouriertransform) phase spectrum is studied by auditory subjective hearing tests. Three main con-clusions are (1) If the phase information is neglected completely, the subjective quality of thereconstructed speech may be very poor; (2) Whether the neglected phase is in low frequencyband or high frequency band, the difference from the original speech can be perceived by ear;(3) It is very difficult for the human ear to perceive the difference of speech quality betweenoriginal speech and reconstructed speech while the phase quantization step size is shorter thanπ/7.
José Antonio Palao Errando
Full Text Available En la llamada «sociedad de la información» los estudios sobre cine se han visto diluidos en el abordaje pragmático y tecnológico del discurso audiovisual, así como la propia fruición del cine se ha visto atrapada en la red del DVD y del hipertexto. El propio cine reacciona ante ello a través de estructuras narrativas complejas que lo alejan del discurso audiovisual estándar. La función de los estudios sobre cine y de su enseñanza universitaria debe ser la reintroducción del sujeto rechazado del saber informativo por medio de la interpretación del texto fílmico. In the so called «information society», film studies have been diluted in the pragmatic and technological approaching of the audiovisual speech, as well as the own fruition of the cinema has been caught in the net of DVD and hypertext. The cinema itself reacts in the face of it through complex narrative structures that take it away from the standard audio-visual speech. The function of film studies at the university education should be the reintroduction of the rejected subject of the informative knowledge by means of the interpretation of film text.
Full Text Available A growing body of evidence shows that brain oscillations track speech. This mechanism is thought to maximise processing efficiency by allocating resources to important speech information, effectively parsing speech into units of appropriate granularity for further decoding. However, some aspects of this mechanism remain unclear. First, while periodicity is an intrinsic property of this physiological mechanism, speech is only quasi-periodic, so it is not clear whether periodicity would present an advantage in processing. Second, it is still a matter of debate which aspect of speech triggers or maintains cortical entrainment, from bottom-up cues such as fluctuations of the amplitude envelope of speech to higher level linguistic cues such as syntactic structure. We present data from a behavioural experiment assessing the effect of isochronous retiming of speech on speech perception in noise. Two types of anchor points were defined for retiming speech, namely syllable onsets and amplitude envelope peaks. For each anchor point type, retiming was implemented at two hierarchical levels, a slow time scale around 2.5 Hz and a fast time scale around 4 Hz. Results show that while any temporal distortion resulted in reduced speech intelligibility, isochronous speech anchored to P-centers (approximated by stressed syllable vowel onsets was significantly more intelligible than a matched anisochronous retiming, suggesting a facilitative role of periodicity defined on linguistically motivated units in processing speech in noise.
This book interconnects two essential disciplines to study the perception of speech: Neuroscience and Quality of Experience, which to date have rarely been used together for the purposes of research on speech quality perception. In five key experiments, the book demonstrates the application of standard clinical methods in neurophysiology on the one hand, and of methods used in fields of research concerned with speech quality perception on the other. Using this combination, the book shows that speech stimuli with different lengths and different quality impairments are accompanied by physiological reactions related to quality variations, e.g., a positive peak in an event-related potential. Furthermore, it demonstrates that – in most cases – quality impairment intensity has an impact on the intensity of physiological reactions.
Recent literature in second language (L2) perceived fluency has focused on English as a second language, with a primary reliance on impressions from native-speaker judges, leaving learners' self-perceptions of speech production unexplored. This study investigates the relationship between learners' and judges' perceptions of French fluency under…
Hollich, George; Prince, Christopher G.
How much of infant behaviour can be accounted for by signal-level analyses of stimuli? The current paper directly compares the moment-by-moment behaviour of 8-month-old infants in an audiovisual preferential looking task with that of several computational models that use the same video stimuli as presented to the infants. One type of model…
Jiang, Jintao; Bernstein, Lynne E.
When the auditory and visual components of spoken audiovisual nonsense syllables are mismatched, perceivers produce four different types of perceptual responses, auditory correct, visual correct, fusion (the so-called "McGurk effect"), and combination (i.e., two consonants are reported). Here, quantitative measures were developed to account for…
Strömbergsson, Sofia; Wengelin, Asa; House, David
We explore children's perception of their own speech - in its online form, in its recorded form, and in synthetically modified forms. Children with phonological disorder (PD) and children with typical speech and language development (TD) performed tasks of evaluating accuracy of the different types of speech stimuli, either immediately after having produced the utterance or after a delay. In addition, they performed a task designed to assess their ability to detect synthetic modification. Both groups showed high performance in tasks involving evaluation of other children's speech, whereas in tasks of evaluating one's own speech, the children with PD were less accurate than their TD peers. The children with PD were less sensitive to misproductions in immediate conjunction with their production of an utterance, and more accurate after a delay. Within-category modification often passed undetected, indicating a satisfactory quality of the generated speech. Potential clinical benefits of using corrective re-synthesis are discussed.
Poeppel, David; Idsardi, William J; van Wassenhove, Virginie
Speech perception consists of a set of computations that take continuously varying acoustic waveforms as input and generate discrete representations that make contact with the lexical representations stored in long-term memory as output. Because the perceptual objects that are recognized by the speech perception enter into subsequent linguistic computation, the format that is used for lexical representation and processing fundamentally constrains the speech perceptual processes. Consequently, theories of speech perception must, at some level, be tightly linked to theories of lexical representation. Minimally, speech perception must yield representations that smoothly and rapidly interface with stored lexical items. Adopting the perspective of Marr, we argue and provide neurobiological and psychophysical evidence for the following research programme. First, at the implementational level, speech perception is a multi-time resolution process, with perceptual analyses occurring concurrently on at least two time scales (approx. 20-80 ms, approx. 150-300 ms), commensurate with (sub)segmental and syllabic analyses, respectively. Second, at the algorithmic level, we suggest that perception proceeds on the basis of internal forward models, or uses an 'analysis-by-synthesis' approach. Third, at the computational level (in the sense of Marr), the theory of lexical representation that we adopt is principally informed by phonological research and assumes that words are represented in the mental lexicon in terms of sequences of discrete segments composed of distinctive features. One important goal of the research programme is to develop linking hypotheses between putative neurobiological primitives (e.g. temporal primitives) and those primitives derived from linguistic inquiry, to arrive ultimately at a biologically sensible and theoretically satisfying model of representation and computation in speech.
Full Text Available This fMRI study examines shared and distinct cortical areas involved in the auditory perception of song and speech at the level of their underlying constituents: words, pitch and rhythm. Univariate and multivariate analyses were performed on the brain activity patterns of six conditions, arranged in a subtractive hierarchy: sung sentences including words, pitch and rhythm; hummed speech prosody and song melody containing only pitch patterns and rhythm; as well as the pure musical or speech rhythm.Systematic contrasts between these balanced conditions following their hierarchical organization showed a great overlap between song and speech at all levels in the bilateral temporal lobe, but suggested a differential role of the inferior frontal gyrus (IFG and intraparietal sulcus (IPS in processing song and speech. The left IFG was involved in word- and pitch-related processing in speech, the right IFG in processing pitch in song.Furthermore, the IPS showed sensitivity to discrete pitch relations in song as opposed to the gliding pitch in speech. Finally, the superior temporal gyrus and premotor cortex coded for general differences between words and pitch patterns, irrespective of whether they were sung or spoken. Thus, song and speech share many features which are reflected in a fundamental similarity of brain areas involved in their perception. However, fine-grained acoustic differences on word and pitch level are reflected in the activity of IFG and IPS.
Moradi, Shahram; Wahlin, Anna; Hällgren, Mathias; Rönnberg, Jerker; Lidestam, Björn
This study aimed to examine the efficacy and maintenance of short-term (one-session) gated audiovisual speech training for improving auditory sentence identification in noise in experienced elderly hearing-aid users. Twenty-five hearing aid users (16 men and 9 women), with an average age of 70.8 years, were randomly divided into an experimental (audiovisual training, n = 14) and a control (auditory training, n = 11) group. Participants underwent gated speech identification tasks comprising Swedish consonants and words presented at 65 dB sound pressure level with a 0 dB signal-to-noise ratio (steady-state broadband noise), in audiovisual or auditory-only training conditions. The Hearing-in-Noise Test was employed to measure participants’ auditory sentence identification in noise before the training (pre-test), promptly after training (post-test), and 1 month after training (one-month follow-up). The results showed that audiovisual training improved auditory sentence identification in noise promptly after the training (post-test vs. pre-test scores); furthermore, this improvement was maintained 1 month after the training (one-month follow-up vs. pre-test scores). Such improvement was not observed in the control group, neither promptly after the training nor at the one-month follow-up. However, no significant between-groups difference nor an interaction between groups and session was observed. Conclusion: Audiovisual training may be considered in aural rehabilitation of hearing aid users to improve listening capabilities in noisy conditions. However, the lack of a significant between-groups effect (audiovisual vs. auditory) or an interaction between group and session calls for further research. PMID:28348542
Anderson, Melinda C; Arehart, Kathryn H; Kates, James M
Speech perception depends on access to spectral and temporal acoustic cues. Temporal cues include slowly varying amplitude changes (i.e. temporal envelope, TE) and quickly varying amplitude changes associated with the center frequency of the auditory filter (i.e. temporal fine structure, TFS). This study quantifies the effects of TFS randomization through noise vocoding on the perception of speech quality by parametrically varying the amount of original TFS available above 1500Hz. The two research aims were: 1) to establish the role of TFS in quality perception, and 2) to determine if the role of TFS in quality perception differs between subjects with normal hearing and subjects with sensorineural hearing loss. Ratings were obtained from 20 subjects (10 with normal hearing and 10 with hearing loss) using an 11-point quality scale. Stimuli were processed in three different ways: 1) A 32-channel noise-excited vocoder with random envelope fluctuations in the noise carrier, 2) a 32-channel noise-excited vocoder with the noise-carrier envelope smoothed, and 3) removal of high-frequency bands. Stimuli were presented in quiet and in babble noise at 18dB and 12dB signal-to-noise ratios. TFS randomization had a measurable detrimental effect on quality ratings for speech in quiet and a smaller effect for speech in background babble. Subjects with normal hearing and subjects with sensorineural hearing loss provided similar quality ratings for noise-vocoded speech.
［1］Richard, P., Schumeyer, Kenneth E. B., The effect of visual information on word initial consonant perception of dysarthric speech, in Proc. ICSLP'96 October 3-6 1996, Philadephia, Pennsylvania, USA.［2］Goff, B. L., Marigny, T. G., Benoit, C., Read my lips...and my jaw! How intelligible are the components of a speaker's face? Eurospeech'95, 4th European Conference on Speech Communication and Technology, Madrid, September 1995.［3］McGurk, H., MacDonald, J. Hearing lips and seeing voices, Nature, 1976, 264: 746.［4］Duran A. F., Mcgurk effect in Spanish and German listeners: Influences of visual cues in the perception of Spanish and German confliction audio-visual stimuli, Eurospeech'95. 4th European Conference on Speech Communication and Technology, Madrid, September 1995.［5］Luettin, J., Visual speech and speaker recognition, Ph.D thesis, University of Sheffield, 1997.［6］Xu Yanjun, Du Limin, Chinese audiovisual bimodal speech database CAVSR1.0, Chinese Journal of Acoustics, to appear.［7］Zhang Jialu, Speech corpora and language input/output methods' evaluation, Chinese Applied Acoustics, 1994, 13(3): 5.
Burfin, Sabine; Pascalis, Olivier; Ruiz Tada, Elisa; Costa, Albert; Savariaux, Christophe; Kandel, Sonia
We all go through a process of perceptual narrowing for phoneme identification. As we become experts in the languages we hear in our environment we lose the ability to identify phonemes that do not exist in our native phonological inventory. This research examined how linguistic experience-i.e., the exposure to a double phonological code during childhood-affects the visual processes involved in non-native phoneme identification in audiovisual speech perception. We conducted a phoneme identification experiment with bilingual and monolingual adult participants. It was an ABX task involving a Bengali dental-retroflex contrast that does not exist in any of the participants' languages. The phonemes were presented in audiovisual (AV) and audio-only (A) conditions. The results revealed that in the audio-only condition monolinguals and bilinguals had difficulties in discriminating the retroflex non-native phoneme. They were phonologically "deaf" and assimilated it to the dental phoneme that exists in their native languages. In the audiovisual presentation instead, both groups could overcome the phonological deafness for the retroflex non-native phoneme and identify both Bengali phonemes. However, monolinguals were more accurate and responded quicker than bilinguals. This suggests that bilinguals do not use the same processes as monolinguals to decode visual speech.
Full Text Available We all go through a process of perceptual narrowing for phoneme identification. As we become experts in the languages we hear in our environment we lose the ability to identify phonemes that do not exist in our native phonological inventory. This research examined how linguistic experience –i.e., the exposure to a double phonological code during childhood– affects the visual processes involved in non-native phoneme identification in audiovisual speech perception. We conducted a phoneme identification experiment with bilingual and monolingual adult participants. It was an ABX task involving a Bengali dental-retroflex contrast that does not exist in any of the participants’ languages. The phonemes were presented in audiovisual (AV and audio-only (A conditions. The results revealed that in the audio-only condition monolinguals and bilinguals had difficulties in discriminating the retroflex non-native phoneme. They were phonologically deaf and assimilated it to the dental phoneme that exists in their native languages. In the audiovisual presentation instead, both groups could overcome the phonological deafness for the retroflex non-native phoneme and identify both Bengali phonemes. However, monolinguals were more accurate and responded quicker than bilinguals. This suggests that bilinguals do not use the same processes as monolinguals to decode visual speech.
Von Tiling, Johannes
This study examined listener perceptions of different ways of speaking often produced by people who stutter. Each of 115 independent listeners made quantitative and qualitative judgments upon watching one of four randomly assigned speech samples. Each of the four video clips showed the same everyday conversation between three young men, but…
Rance, Gary; Fava, Rosanne; Baldock, Heath; Chong, April; Barker, Elizabeth; Corben, Louise; Delatycki
The aim of this study was to investigate auditory pathway function and speech perception ability in individuals with Friedreich ataxia (FRDA). Ten subjects confirmed by genetic testing as being homozygous for a GAA expansion in intron 1 of the FXN gene were included. While each of the subjects demonstrated normal, or near normal sound detection, 3…
Phonetic variation has been considered a barrier that listeners must overcome in speech perception, but has been proved beneficial in category learning. In this paper, I show that listeners use within-speaker variation to accommodate gross categorical variation. Within the perceptual learning paradigm, listeners are exposed to p-initial words in…
Accounts of speech perception disagree on whether listeners perceive the acoustic signal (Diehl, Lotto, & Holt, 2004) or the vocal tract gestures that produce the signal (e.g., Fowler, 1986). In this dissertation, I outline a research program using a phenomenon called "perceptual compensation for coarticulation" (Mann, 1980) to examine this…
ZHOU Li; NIE Yong-Wei
The paper tries to analyze speech perception in terms of its structure, process, levels and models. Some problems con⁃cerning speech perception have been touched upon. The paper aims at providing some reference for oral English teaching and learning in the light of speech perception. It is intended to arouse readers’reflection upon the effect of speech perception on oral English teaching.
Full Text Available Audiovisual translation is now a well-established sub-discipline of Translation Studies (TS: a position that it has reached over the last twenty years or so. Italian scholars and professionals in the field have made a substantial contribution to this successful development, a brief overview of which will be given in the first part of this article, inevitably concentrating on dubbing in the Italian context. Special attention will be devoted to the question of target audience perception, an area where researchers in the University of Bologna at Forlì have excelled. The second part of the article applies the methodology followed by the above mentioned researchers in a case study of how Italian end users perceive the dubbed version of the British film The History Boys (2006, which contains a plethora of culture-specific verbal and visual references to the English education system. The aim of the study was to ascertain: a whether translation/adaptation allows the transmission in this admittedly constrained medium of all the intended culture-bound issues, only too well known to the source audience, and, if so, to what extent, and b whether the target audience respondents to the e-questionnaire used were aware that they were missing information. The linked, albeit controversial, issue of quality assessment will also be addressed.
Full Text Available The development of speech perception shows a dramatic transition between infancy and adulthood. Between 6 and 12 months, infants’ initial ability to discriminate all phonetic units across the worlds’ languages narrows—native discrimination increases while nonnative discrimination shows a steep decline. We used magnetoencephalography (MEG to examine whether brain oscillations in the theta band (4-8Hz, reflecting increases in attention and cognitive effort, would provide a neural measure of the perceptual narrowing phenomenon in speech. Using an oddball paradigm, we varied speech stimuli in two dimensions, stimulus frequency (frequent vs. infrequent and language (native vs. nonnative speech syllables and tested 6-month-old infants, 12-month-old infants, and adults. We hypothesized that 6-month-old infants would show increased relative theta power (RTP for frequent syllables, regardless of their status as native or nonnative syllables, reflecting young infants’ attention and cognitive effort in response to highly frequent stimuli (statistical learning. In adults, we hypothesized increased RTP for nonnative stimuli, regardless of their presentation frequency, reflecting increased cognitive effort for nonnative phonetic categories. The 12-month-old infants were expected to show a pattern in transition, but one more similar to adults than to 6-month-old infants. The MEG brain rhythm results supported these hypotheses. We suggest that perceptual narrowing in speech perception is governed by an implicit learning process. This learning process involves an implicit shift in attention from frequent events (infants to learned categories (adults. Theta brain oscillatory activity may provide an index of perceptual narrowing beyond speech, and would offer a test of whether the early speech learning process is governed by domain-general or domain-specific processes.
Schellenberg, E Glenn
Claims of beneficial side effects of music training are made for many different abilities, including verbal and visuospatial abilities, executive functions, working memory, IQ, and speech perception in particular. Such claims assume that music training causes the associations even though children who take music lessons are likely to differ from other children in music aptitude, which is associated with many aspects of speech perception. Music training in childhood is also associated with cognitive, personality, and demographic variables, and it is well established that IQ and personality are determined largely by genetics. Recent evidence also indicates that the role of genetics in music aptitude and music achievement is much larger than previously thought. In short, music training is an ideal model for the study of gene-environment interactions but far less appropriate as a model for the study of plasticity. Children seek out environments, including those with music lessons, that are consistent with their predispositions; such environments exaggerate preexisting individual differences.
Human locomotion typically creates noise, a possible consequence of which is the masking of sound signals originating in the surroundings. When walking side by side, people often subconsciously synchronize their steps. The neurophysiological and evolutionary background of this behavior is unclear. The present study investigated the potential of sound created by walking to mask perception of speech and compared the masking produced by walking in step with that produced by unsynchronized walkin...
Gick, Bryan; Derrick, Donald
Visual information from a speaker’s face can enhance1 or interfere with2 accurate auditory perception. This integration of information across auditory and visual streams has been observed in functional imaging studies3,4, and has typically been attributed to the frequency and robustness with which perceivers jointly encounter event-specific information from these two modalities5. Adding the tactile modality has long been considered a crucial next step in understanding multisensory integration...
Pisoni, D B
Over the past few years, there has been increased interest in studying some of the cognitive factors that affect speech perception performance of cochlear implant patients. In this paper, I provide a brief theoretical overview of the fundamental assumptions of the information-processing approach to cognition and discuss the role of perception, learning, and memory in speech perception and spoken language processing. The information-processing framework provides researchers and clinicians with a new way to understand the time-course of perceptual and cognitive development and the relations between perception and production of spoken language. Directions for future research using this approach are discussed including the study of individual differences, predicting success with a cochlear implant from a set of cognitive measures of performance and developing new intervention strategies.
Full Text Available Listeners vary in their ability to understand speech in noisy environments. Hearing sensitivity, as measured by pure-tone audiometry, can only partly explain these results, and cognition has emerged as another key concept. Although cognition relates to speech perception, the exact nature of the relationship remains to be fully understood. This study investigates how different aspects of cognition, particularly working memory and attention, relate to speech intelligibility for various tests.Perceptual accuracy of speech perception represents just one aspect of functioning in a listening environment. Activity and participation limits imposed by hearing loss, in addition to the demands of a listening environment, are also important and may be better captured by self-report questionnaires. Understanding how speech perception relates to self-reported aspects of listening forms the second focus of the study.Forty-four listeners aged between 50-74 years with mild SNHL were tested on speech perception tests differing in complexity from low (phoneme discrimination in quiet, to medium (digit triplet perception in speech-shaped noise to high (sentence perception in modulated noise; cognitive tests of attention, memory, and nonverbal IQ; and self-report questionnaires of general health-related and hearing-specific quality of life.Hearing sensitivity and cognition related to intelligibility differently depending on the speech test: neither was important for phoneme discrimination, hearing sensitivity alone was important for digit triplet perception, and hearing and cognition together played a role in sentence perception. Self-reported aspects of auditory functioning were correlated with speech intelligibility to different degrees, with digit triplets in noise showing the richest pattern. The results suggest that intelligibility tests can vary in their auditory and cognitive demands and their sensitivity to the challenges that auditory environments pose on
Santos, Karlos Thiago Pinheiro dos
Full Text Available Introduction: The most frequent complaint of the cochlear implant users has been to recognize and understand the speech signal in the presence of noise. Researches have been developed on the speech perception of users of cochlear implant with focus on aspects such as the effect of the reduction to the signal/noise ratio in the speech perception, the speech recognition in the noise, with different types of cochlear implant and strategies of speech codification and the effects of the binaural stimulation in the speech perception in noise. Objective: 1-To assess the speech perception in cochlear implant adult users in different positions regarding the presentation of the stimulus, 2-to compare the index of speech recognition in the frontal, ipsilateral and contralateral positions and 3-to analyze the effect of monoaural adaptation in the speech perception with noise. Method: 22 cochlear implant adult users were evaluated regarding the speech perception. The individuals were submitted to sentences recognition evaluation, with competitive noise in the signal/noise ratio +10 decibels in three positions: frontal, ipsilateral and contralateral to the cochlear implant side. Results: The results demonstrated the largest index of speech recognition in the ipsilateral position (100% and the lowest index of speech recognition with sentences in the contralateral position (5%. Conclusion: The performance of speech perception in cochlear implant users is damaged when the competitive noise is introduced, the index of speech recognition is better when the speech is presented ipsilaterally, and it's consequently worse when presented contralaterally to the cochlear implant, and there are more damages in the speech intelligibility when there is only monoaural input.
Full Text Available In many natural audiovisual events (e.g., a clap of the two hands, the visual signal precedes the sound and thus allows observers to predict when, where, and which sound will occur. Previous studies have already reported that there are distinct neural correlates of temporal (when versus phonetic/semantic (which content on audiovisual integration. Here we examined the effect of visual prediction of auditory location (where in audiovisual biological motion stimuli by varying the spatial congruency between the auditory and visual part of the audiovisual stimulus. Visual stimuli were presented centrally, whereas auditory stimuli were presented either centrally or at 90° azimuth. Typical subadditive amplitude reductions (AV – V < A were found for the auditory N1 and P2 for spatially congruent and incongruent conditions. The new finding is that the N1 suppression was larger for spatially congruent stimuli. A very early audiovisual interaction was also found at 30-50 ms in the spatially congruent condition, while no effect of congruency was found on the suppression of the P2. This indicates that visual prediction of auditory location can be coded very early in auditory processing.
Kishon-Rabin, L; Rosenhouse, J
The high incidence of hearing impairment in the Arabic-speaking population in Israel, as well as the use of advanced aural rehabilitation devices, motivated the development of Arabic speech assessment tests for this population. The purpose of this paper is twofold. The first goal is to describe features that are unique to the Arabic language and that need to be considered when developing such speech tests. These include Arabic diglossia (i.e., the sharp dichotomy between Literary and Colloquial Arabic), emphatization, and a simple vowel system. The second goal is to describe a new analytic speech test that assesses the perception of significant phonological contrasts in the Colloquial Arabic variety used in Israel. The perception of voicing, place, and manner of articulation, in both initial and final word positions, was tested at four sensation levels in 10 normally-hearing subjects using a binary forced-choice paradigm. Results show a relationship between percent correct and presentation level that is in keeping with articulation curves obtained with Saudi Arabic and English monosyllabic words. Furthermore, different contrasts yielded different articulation curves: emphatization was the easiest to perceive whereas place of articulation was the most difficult. The results can be explained by the specific acoustical features of Arabic.
Cason, Nia; Astésano, Corine; Schön, Daniele
Following findings that musical rhythmic priming enhances subsequent speech perception, we investigated whether rhythmic priming for spoken sentences can enhance phonological processing - the building blocks of speech - and whether audio-motor training enhances this effect. Participants heard a metrical prime followed by a sentence (with a matching/mismatching prosodic structure), for which they performed a phoneme detection task. Behavioural (RT) data was collected from two groups: one who received audio-motor training, and one who did not. We hypothesised that 1) phonological processing would be enhanced in matching conditions, and 2) audio-motor training with the musical rhythms would enhance this effect. Indeed, providing a matching rhythmic prime context resulted in faster phoneme detection, thus revealing a cross-domain effect of musical rhythm on phonological processing. In addition, our results indicate that rhythmic audio-motor training enhances this priming effect. These results have important implications for rhythm-based speech therapies, and suggest that metrical rhythm in music and speech may rely on shared temporal processing brain resources.
Bidelman, Gavin M; Moreno, Sylvain; Alain, Claude
Speech perception requires the effortless mapping from smooth, seemingly continuous changes in sound features into discrete perceptual units, a conversion exemplified in the phenomenon of categorical perception. Explaining how/when the human brain performs this acoustic-phonetic transformation remains an elusive problem in current models and theories of speech perception. In previous attempts to decipher the neural basis of speech perception, it is often unclear whether the alleged brain correlates reflect an underlying percept or merely changes in neural activity that covary with parameters of the stimulus. Here, we recorded neuroelectric activity generated at both cortical and subcortical levels of the auditory pathway elicited by a speech vowel continuum whose percept varied categorically from /u/ to /a/. This integrative approach allows us to characterize how various auditory structures code, transform, and ultimately render the perception of speech material as well as dissociate brain responses reflecting changes in stimulus acoustics from those that index true internalized percepts. We find that activity from the brainstem mirrors properties of the speech waveform with remarkable fidelity, reflecting progressive changes in speech acoustics but not the discrete phonetic classes reported behaviorally. In comparison, patterns of late cortical evoked activity contain information reflecting distinct perceptual categories and predict the abstract phonetic speech boundaries heard by listeners. Our findings demonstrate a critical transformation in neural speech representations between brainstem and early auditory cortex analogous to an acoustic-phonetic mapping necessary to generate categorical speech percepts. Analytic modeling demonstrates that a simple nonlinearity accounts for the transformation between early (subcortical) brain activity and subsequent cortical/behavioral responses to speech (>150-200 ms) thereby describing a plausible mechanism by which the
Gerrits, Ellen; de Bree, Elise
Speech perception and speech production were examined in 3-year-old Dutch children at familial risk of developing dyslexia. Their performance in speech sound categorisation and their production of words was compared to that of age-matched children with specific language impairment (SLI) and typically developing controls. We found that speech…
Hickok, Gregory; Costanzo, Maddalena; Capasso, Rita; Miceli, Gabriele
Motor theories of speech perception have been re-vitalized as a consequence of the discovery of mirror neurons. Some authors have even promoted a strong version of the motor theory, arguing that the motor speech system is critical for perception. Part of the evidence that is cited in favor of this claim is the observation from the early 1980s that…
Ziegler, Johannes C.; Pech-Georgel, Catherine; George, Florence; Lorenzi, Christian
Speech perception of four phonetic categories (voicing, place, manner, and nasality) was investigated in children with specific language impairment (SLI) (n=20) and age-matched controls (n=19) in quiet and various noise conditions using an AXB two-alternative forced-choice paradigm. Children with SLI exhibited robust speech perception deficits in…
Mealings, Kiri T.; Demuth, Katherine; Buchholz, Jörg; Dillon, Harvey
Purpose: Open-plan classroom styles are increasingly being adopted in Australia despite evidence that their high intrusive noise levels adversely affect learning. The aim of this study was to develop a new Australian speech perception task (the Mealings, Demuth, Dillon, and Buchholz Classroom Speech Perception Test) and use it in an open-plan…
Burgaleta, Miguel; Baus, Cristina; Díaz, Begoña; Sebastián-Gallés, Núria
Morphology of the human brain predicts the speed at which individuals learn to distinguish novel foreign speech sounds after laboratory training. However, little is known about the neuroanatomical basis of individual differences in speech perception when a second language (L2) has been learned in natural environments for extended periods of time. In the present study, two samples of highly proficient bilinguals were selected according to their ability to distinguish between very similar L2 sounds, either isolated (prelexical) or within words (lexical). Structural MRI was acquired and processed to estimate vertex-wise indices of cortical thickness (CT) and surface area (CSA), and the association between cortical morphology and behavioral performance was inspected. Results revealed that performance in the lexical task was negatively associated with the thickness of the left temporal cortex and angular gyrus, as well as with the surface area of the left precuneus. Our findings, consistently with previous fMRI studies, demonstrate that morphology of the reported areas is relevant for word recognition based on phonological information. Further, we discuss the possibility that increased CT and CSA in sound-to-meaning mapping regions, found for poor non-native speech sounds perceivers, would have plastically arisen after extended periods of increased functional activity during L2 exposure.
Hakvoort, Britt; de Bree, Elise; van der Leij, Aryan; Maassen, Ben; van Setten, Ellie; Maurits, Natasha; van Zuijen, Titia L.
Purpose: This study assessed whether a categorical speech perception (CP) deficit is associated with dyslexia or familial risk for dyslexia, by exploring a possible cascading relation from speech perception to phonology to reading and by identifying whether speech perception distinguishes familial risk (FR) children with dyslexia (FRD) from those…
Möttönen, Riikka; Watkins, Kate E
Background: The ability to communicate using speech is a remarkable skill, which requires precise coordination of articulatory movements and decoding of complex acoustic signals. According to the traditional view, speech production and perception rely on motor and auditory brain areas, respectively. However, there is growing evidence that auditory-motor circuits support both speech production and perception.Aims: In this article we provide a review of how transcranial magnetic stimulation (TMS) has been used to investigate the excitability of the motor system during listening to speech and the contribution of the motor system to performance in various speech perception tasks. We also discuss how TMS can be used in combination with brain-imaging techniques to study interactions between motor and auditory systems during speech perception.Main contribution: TMS has proven to be a powerful tool to investigate the role of the articulatory motor system in speech perception.Conclusions: TMS studies have provided support for the view that the motor structures that control the movements of the articulators contribute not only to speech production but also to speech perception.
Full Text Available Human locomotion typically creates noise, a possible consequence of which is the masking of sound signals originating in the surroundings. When walking side by side, people often subconsciously synchronize their steps. The neurophysiological and evolutionary background of this behavior is unclear. The present study investigated the potential of sound created by walking to mask perception of speech and compared the masking produced by walking in step with that produced by unsynchronized walking. The masking sound (footsteps on gravel and the target sound (speech were presented through the same speaker to 15 normal-hearing subjects. The original recorded walking sound was modified to mimic the sound of two individuals walking in pace or walking out of synchrony. The participants were instructed to adjust the sound level of the target sound until they could just comprehend the speech signal ("just follow conversation" or JFC level when presented simultaneously with synchronized or unsynchronized walking sound at 40 dBA, 50 dBA, 60 dBA, or 70 dBA. Synchronized walking sounds produced slightly less masking of speech than did unsynchronized sound. The median JFC threshold in the synchronized condition was 38.5 dBA, while the corresponding value for the unsynchronized condition was 41.2 dBA. Combined results at all sound pressure levels showed an improvement in the signal-to-noise ratio (SNR for synchronized footsteps; the median difference was 2.7 dB and the mean difference was 1.2 dB [P < 0.001, repeated-measures analysis of variance (RM-ANOVA]. The difference was significant for masker levels of 50 dBA and 60 dBA, but not for 40 dBA or 70 dBA. This study provides evidence that synchronized walking may reduce the masking potential of footsteps.
Larsson, Matz; Ekström, Seth Reino; Ranjbar, Parivash
Human locomotion typically creates noise, a possible consequence of which is the masking of sound signals originating in the surroundings. When walking side by side, people often subconsciously synchronize their steps. The neurophysiological and evolutionary background of this behavior is unclear. The present study investigated the potential of sound created by walking to mask perception of speech and compared the masking produced by walking in step with that produced by unsynchronized walking. The masking sound (footsteps on gravel) and the target sound (speech) were presented through the same speaker to 15 normal-hearing subjects. The original recorded walking sound was modified to mimic the sound of two individuals walking in pace or walking out of synchrony. The participants were instructed to adjust the sound level of the target sound until they could just comprehend the speech signal ("just follow conversation" or JFC level) when presented simultaneously with synchronized or unsynchronized walking sound at 40 dBA, 50 dBA, 60 dBA, or 70 dBA. Synchronized walking sounds produced slightly less masking of speech than did unsynchronized sound. The median JFC threshold in the synchronized condition was 38.5 dBA, while the corresponding value for the unsynchronized condition was 41.2 dBA. Combined results at all sound pressure levels showed an improvement in the signal-to-noise ratio (SNR) for synchronized footsteps; the median difference was 2.7 dB and the mean difference was 1.2 dB [P < 0.001, repeated-measures analysis of variance (RM-ANOVA)]. The difference was significant for masker levels of 50 dBA and 60 dBA, but not for 40 dBA or 70 dBA. This study provides evidence that synchronized walking may reduce the masking potential of footsteps.
The perception of human languages is inherently a multi-modalprocess, in which audio information can be compensated by visual information to improve the recognition performance. Such a phenomenon in English, German, Spanish and so on has been researched, but in Chinese it has not been reported yet. In our experiment, 14 syllables (/ba, bi, bian, biao, bin, de, di, dian, duo, dong, gai, gan, gen, gu/), extracted from Chinese audiovisual bimodal speech database CAVSR-1.0, were pronounced by 10 subjects. The audio-only stimuli, audiovisual stimuli, and visual-only stimuli were recognized by 20 observers. The audio-only stimuli and audiovisual stimuli both were presented under 5 conditions: no noise, SNR 0 dB, -8 dB, -12 dB, and -16 dB. The experimental result is studied and the following conclusions for Chinese speech are reached. Human beings can recognize visual-only stimuli rather well. The place of articulation determines the visual distinction. In noisy environment, audio information can remarkably be compensated by visual information and as a result the recognition performance is greatly improved.
One theory (weak central coherence) that accounts for a different perceptual-cognitive style in autism may suggest the possibility that individuals with autism are less likely to be affected by lexical knowledge on speech perception. This lexical context effects on speech perception has been evidenced by Ganong (1980) by using word-to-nonword identification test along a VOT dimension. This Ganong effect (which suggests that people tend to make their percept a real word) can be seen as one ...
Hemphill, Rachel Marie
In a series of five experiments, the author tests the hypothesis that speech processing in the human mind demands two separate phonological representations: one for perception and one for production (Menn 1980, 1983; Straight 1980; Menn & Matthei 1992). The experiments probe the structure and of these mental categories and how they change in the process of acquisition. Three groups of native English-speaking subjects were taught to categorically perceive a three way Thai voicing contrast in synthetic bilabial stop consonants, which varied only in VOT (after Pisoni, Aslin, Perey, and Hennessy 1982). Perception and production tests were administered following training. Subjects showed the ability, which improved with training, to categorically identify the three-way voicing contrast. Subsequent acoustic and perceptual analyses showed that they were unable to produce the contrast correctly, producing no difference, or manipulating acoustic variables other than VOT (vowel duration, vowel quality, nasalization, etc.). When subjects' productions were compared to their pronunciations of English labial stops, it was found that subjects construct a new production category for the Thai prevoiced stop category. In contrast, subjects split their existing English perceptual /b/ category, indicating that perceptual and production phonological categories do not change in parallel. In a subsequent experiment, subjects were re-tested on perception of the synthetic stimuli, productions of two native Thai speakers, and on their own productions from the previous experiments. An analysis of the perceptual data shows that subjects performed equally well on the four tasks, indicating that they are no better at identifying their own productions than those of novel talkers or synthetic talkers. This finding contradicts the hypothetical direct link between perception and production phonologies. These results are explained in terms of separate expressive and receptive representations and the
Full Text Available Developmental stuttering is a speech disorder that disrupts the ability to produce speech fluently. While stuttering is typically diagnosed based on one's behavior during speech production, some models suggest that it involves more central representations of language, and thus may affect language perception as well. Here we tested the hypothesis that developmental stuttering implicates neural systems involved in language perception, in a task that manipulates comprehensibility without an overt speech production component. We used functional magnetic resonance imaging to measure blood oxygenation level dependent (BOLD signals in adults who do and do not stutter, while they were engaged in an incidental speech perception task. We found that speech perception evokes stronger activation in adults who stutter (AWS compared to controls, specifically in the right inferior frontal gyrus (RIFG and in left Heschl's gyrus (LHG. Significant differences were additionally found in the lateralization of response in the inferior frontal cortex: AWS showed bilateral inferior frontal activity, while controls showed a left lateralized pattern of activation. These findings suggest that developmental stuttering is associated with an imbalanced neural network for speech processing, which is not limited to speech production, but also affects cortical responses during speech perception.
Yamamoto, Kosuke; Kawabata, Hideaki
We ordinarily speak fluently, even though our perceptions of our own voices are disrupted by various environmental acoustic properties. The underlying mechanism of speech is supposed to monitor the temporal relationship between speech production and the perception of auditory feedback, as suggested by a reduction in speech fluency when the speaker is exposed to delayed auditory feedback (DAF). While many studies have reported that DAF influences speech motor processing, its relationship to the temporal tuning effect on multimodal integration, or temporal recalibration, remains unclear. We investigated whether the temporal aspects of both speech perception and production change due to adaptation to the delay between the motor sensation and the auditory feedback. This is a well-used method of inducing temporal recalibration. Participants continually read texts with specific DAF times in order to adapt to the delay. Then, they judged the simultaneity between the motor sensation and the vocal feedback. We measured the rates of speech with which participants read the texts in both the exposure and re-exposure phases. We found that exposure to DAF changed both the rate of speech and the simultaneity judgment, that is, participants' speech gained fluency. Although we also found that a delay of 200 ms appeared to be most effective in decreasing the rates of speech and shifting the distribution on the simultaneity judgment, there was no correlation between these measurements. These findings suggest that both speech motor production and multimodal perception are adaptive to temporal lag but are processed in distinct ways.
Reed, C M; Durlach, N I; Braida, L D; Schultz, M C
In the Tadoma method of communication, deaf-blind individuals receive speech by placing a hand on the face and neck of the talker and monitoring actions associated with speech production. Previous research has documented the speech perception, speech production, and linguistic abilities of highly experienced users of the Tadoma method. The current study was performed to gain further insight into the cues involved in the perception of speech segments through Tadoma. Small-set segmental identification experiments were conducted in which the subjects' access to various types of articulatory information was systematically varied by imposing limitations on the contact of the hand with the face. Results obtained on 3 deaf-blind, highly experienced users of Tadoma were examined in terms of percent-correct scores, information transfer, and reception of speech features for each of sixteen experimental conditions. The results were generally consistent with expectations based on the speech cues assumed to be available in the various hand positions.
Smayda, Kirsten E; Van Engen, Kristin J; Maddox, W Todd; Chandrasekaran, Bharath
Speech perception is critical to everyday life. Oftentimes noise can degrade a speech signal; however, because of the cues available to the listener, such as visual and semantic cues, noise rarely prevents conversations from continuing. The interaction of visual and semantic cues in aiding speech perception has been studied in young adults, but the extent to which these two cues interact for older adults has not been studied. To investigate the effect of visual and semantic cues on speech perception in older and younger adults, we recruited forty-five young adults (ages 18-35) and thirty-three older adults (ages 60-90) to participate in a speech perception task. Participants were presented with semantically meaningful and anomalous sentences in audio-only and audio-visual conditions. We hypothesized that young adults would outperform older adults across SNRs, modalities, and semantic contexts. In addition, we hypothesized that both young and older adults would receive a greater benefit from a semantically meaningful context in the audio-visual relative to audio-only modality. We predicted that young adults would receive greater visual benefit in semantically meaningful contexts relative to anomalous contexts. However, we predicted that older adults could receive a greater visual benefit in either semantically meaningful or anomalous contexts. Results suggested that in the most supportive context, that is, semantically meaningful sentences presented in the audiovisual modality, older adults performed similarly to young adults. In addition, both groups received the same amount of visual and meaningful benefit. Lastly, across groups, a semantically meaningful context provided more benefit in the audio-visual modality relative to the audio-only modality, and the presence of visual cues provided more benefit in semantically meaningful contexts relative to anomalous contexts. These results suggest that older adults can perceive speech as well as younger adults when both
David L Woods
Full Text Available The most common complaint of older hearing impaired (OHI listeners is difficulty understanding speech in the presence of noise. However, tests of consonant-identification and sentence reception threshold (SeRT provide different perspectives on the magnitude of impairment. Here we quantified speech perception difficulties in 24 OHI listeners in unaided and aided conditions by analyzing (1 consonant-identification thresholds and consonant confusions for 20 onset and 20 coda consonants in consonant-vowel-consonant (CVC syllables presented at consonant-specific signal-to-noise (SNR levels, and (2 SeRTs obtained with the Quick Speech in Noise Test (QSIN and the Hearing in Noise Test (HINT. Compared to older normal hearing (ONH listeners, nearly all unaided OHI listeners showed abnormal consonant-identification thresholds, abnormal consonant confusions, and reduced psychometric function slopes. Average elevations in consonant-identification thresholds exceeded 35 dB, correlated strongly with impairments in mid-frequency hearing, and were greater for hard-to-identify consonants. Advanced digital hearing aids (HAs improved average consonant-identification thresholds by more than 17 dB, with significant HA benefit seen in 83% of OHI listeners. HAs partially normalized consonant-identification thresholds, reduced abnormal consonant confusions, and increased the slope of psychometric functions. Unaided OHI listeners showed much smaller elevations in SeRTs (mean 6.9 dB than in consonant-identification thresholds and SeRTs in unaided listening conditions correlated strongly (r = 0.91 with identification thresholds of easily identified consonants. HAs produced minimal SeRT benefit (2.0 dB, with only 38% of OHI listeners showing significant improvement. HA benefit on SeRTs was accurately predicted (r = 0.86 by HA benefit on easily identified consonants. Consonant-identification tests can accurately predict sentence processing deficits and HA benefit in OHI
Full Text Available This study investigated whether auditory, speech perception and phonological skills are tightly interrelated or independently contributing to reading. We assessed each of these three skills in 36 adults with a past diagnosis of dyslexia and 54 matched normal reading adults. Phonological skills were tested by the typical threefold tasks, i.e. rapid automatic naming, verbal short term memory and phonological awareness. Dynamic auditory processing skills were assessed by means of a frequency modulation (FM and an amplitude rise time (RT; an intensity discrimination task (ID was included as a non-dynamic control task. Speech perception was assessed by means of sentences and words in noise tasks. Group analysis revealed significant group differences in auditory tasks (i.e. RT and ID and in phonological processing measures, yet no differences were found for speech perception. In addition, performance on RT discrimination correlated with reading but this relation was mediated by phonological processing and not by speech in noise. Finally, inspection of the individual scores revealed that the dyslexic readers showed an increased proportion of deviant subjects on the slow-dynamic auditory and phonological tasks, yet each individual dyslexic reader does not display a clear pattern of deficiencies across the levels of processing skills. Although our results support phonological and slow-rate dynamic auditory deficits which relate to literacy, they suggest that at the individual level, problems in reading and writing cannot be explained by the cascading auditory theory. Instead, dyslexic adults seem to vary considerably in the extent to which each of the auditory and phonological factors are expressed and interact with environmental and higher-order cognitive influences.
Dole, Marjorie; Meunier, Fanny; Hoen, Michel
Dyslexia is a language-based neurodevelopmental disorder. It is characterized as a persistent deficit in reading and spelling. These difficulties have been shown to result from an underlying impairment of the phonological component of language, possibly also affecting speech perception. Although there is little evidence for such a deficit under optimal, quiet listening conditions, speech perception difficulties in adults with dyslexia are often reported under more challenging conditions, such as when speech is masked by noise. Previous studies have shown that these difficulties are more pronounced when the background noise is speech and when little spatial information is available to facilitate differentiation between target and background sound sources. In this study, we investigated the neuroimaging correlates of speech-in-speech perception in typical readers and participants with dyslexia, focusing on the effects of different listening configurations. Fourteen adults with dyslexia and 14 matched typical readers performed a subjective intelligibility rating test with single words presented against concurrent speech during functional magnetic resonance imaging (fMRI) scanning. Target words were always presented with a four-talker background in one of three listening configurations: Dichotic, Binaural or Monaural. The results showed that in the Monaural configuration, in which no spatial information was available and energetic masking was maximal, intelligibility was severely decreased in all participants, and this effect was particularly strong in participants with dyslexia. Functional imaging revealed that in this configuration, participants partially compensate for their poorer listening abilities by recruiting several areas in the cerebral networks engaged in speech perception. In the Binaural configuration, participants with dyslexia achieved the same performance level as typical readers, suggesting that they were able to use spatial information when available
This study supports claims of a relationship between speech perception and phonology with evidence from a crosslinguistic perception experiment involving /h/ deletion in Turkish. Turkish /h/ is often deleted in fast speech, but only in a specific set of segmental contexts which defy traditional explanation. It is shown that /h/ deletes in environments where lower perceptibility is predicted. The results of the perception experiment verify these predictions and further show that language background has a significant impact on speech perception. Finally, this perceptual account of Turkish /h/ deletion points to an empirical means of testing the conflicting hypotheses that perception is active in the synchronic grammar or that its influence is limited to diachrony.
Full Text Available The neural basis of speech perception has been debated for over a century. While it is generally agreed that the superior temporal lobes are critical for the perceptual analysis of speech, a major current topic is whether the motor system contributes to speech perception, with several conflicting findings attested. In a dorsal-ventral speech stream framework (Hickok & Poeppel 2007, this debate is essentially about the roles of the dorsal versus ventral speech processing streams. A major roadblock in characterizing the neuroanatomy of speech perception is task-specific effects. For example, much of the evidence for dorsal stream involvement comes from syllable discrimination type tasks, which have been found to behaviorally doubly dissociate from auditory comprehension tasks (Baker et al. 1981. Discrimination task deficits could be a result of difficulty perceiving the sounds themselves, which is the typical assumption, or it could be a result of failures in temporary maintenance of the sensory traces, or the comparison and/or the decision process. Similar complications arise in perceiving sentences: the extent of inferior frontal (i.e. dorsal stream activation during listening to sentences increases as a function of increased task demands (Love et al. 2006. Another complication is the stimulus: much evidence for dorsal stream involvement uses speech samples lacking semantic context (CVs, non-words. The present study addresses these issues in a large-scale lesion-symptom mapping study. 158 patients with focal cerebral lesions from the Mutli-site Aphasia Research Consortium underwent a structural MRI or CT scan, as well as an extensive psycholinguistic battery. Voxel-based lesion symptom mapping was used to compare the neuroanatomy involved in the following speech perception tasks with varying phonological, semantic, and task loads: (i two discrimination tasks of syllables (non-words and words, respectively, (ii two auditory comprehension tasks
Newcombe, Nora; Arnkoff, Diane B.
Two experiments examined Lakoff's suggestion that men and women use different speech styles (women's speech being more polite and less assertive than men's). The effects of undergraduate students' use of three linguistic variables (tag questions, qualifiers, and compound requests) on person perception was tested. (CM)
Viswanathan, Navin; Magnuson, James S.; Fowler, Carol A.
According to one approach to speech perception, listeners perceive speech by applying general pattern matching mechanisms to the acoustic signal (e.g., Diehl, Lotto, & Holt, 2004). An alternative is that listeners perceive the phonetic gestures that structured the acoustic signal (e.g., Fowler, 1986). The two accounts have offered different…
Full Text Available Audio‐visual speech recognition is a natural and robust approach to improving human-robot interaction in noisy environments. Although multi‐stream Dynamic Bayesian Network and coupled HMM are widely used for audio‐visual speech recognition, they fail to learn the shared features between modalities and ignore the dependency of features among the frames within each discrete state. In this paper, we propose a Deep Dynamic Bayesian Network (DDBN to perform unsupervised extraction of spatial‐temporal multimodal features from Tibetan audio‐visual speech data and build an accurate audio‐visual speech recognition model under a no frame‐independency assumption. The experiment results on Tibetan speech data from some real‐world environments showed the proposed DDBN outperforms the state‐of‐art methods in word recognition accuracy.
Ben-David, Boaz M.; Multani, Namita; Shakuf, Vered; Rudzicz, Frank; van Lieshout, Pascal H. H. M.
Purpose: Our aim is to explore the complex interplay of prosody (tone of speech) and semantics (verbal content) in the perception of discrete emotions in speech. Method: We implement a novel tool, the Test for Rating of Emotions in Speech. Eighty native English speakers were presented with spoken sentences made of different combinations of 5…
Eg, Ragnhild; Behne, Dawn M
In well-controlled laboratory experiments, researchers have found that humans can perceive delays between auditory and visual signals as short as 20 ms. Conversely, other experiments have shown that humans can tolerate audiovisual asynchrony that exceeds 200 ms. This seeming contradiction in human temporal sensitivity can be attributed to a number of factors such as experimental approaches and precedence of the asynchronous signals, along with the nature, duration, location, complexity and repetitiveness of the audiovisual stimuli, and even individual differences. In order to better understand how temporal integration of audiovisual events occurs in the real world, we need to close the gap between the experimental setting and the complex setting of everyday life. With this work, we aimed to contribute one brick to the bridge that will close this gap. We compared perceived synchrony for long-running and eventful audiovisual sequences to shorter sequences that contain a single audiovisual event, for three types of content: action, music, and speech. The resulting windows of temporal integration showed that participants were better at detecting asynchrony for the longer stimuli, possibly because the long-running sequences contain multiple corresponding events that offer audiovisual timing cues. Moreover, the points of subjective simultaneity differ between content types, suggesting that the nature of a visual scene could influence the temporal perception of events. An expected outcome from this type of experiment was the rich variation among participants' distributions and the derived points of subjective simultaneity. Hence, the designs of similar experiments call for more participants than traditional psychophysical studies. Heeding this caution, we conclude that existing theories on multisensory perception are ready to be tested on more natural and representative stimuli.
Gick, Bryan; Jóhannsdóttir, Kristín M.; Gibraiel, Diana; Mühlbauer, Jeff
A single pool of untrained subjects was tested for interactions across two bimodal perception conditions: audio-tactile, in which subjects heard and felt speech, and visual-tactile, in which subjects saw and felt speech. Identifications of English obstruent consonants were compared in bimodal and no-tactile baseline conditions. Results indicate that tactile information enhances speech perception by about 10 percent, regardless of which other mode (auditory or visual) is active. However, within-subject analysis indicates that individual subjects who benefit more from tactile information in one cross-modal condition tend to benefit less from tactile information in the other. PMID:18396924
Fernando Orphão de Carvalho
Full Text Available . In this paper we purport to qualify the claim, advanced by Appelbaum (1999 that speech perception research, in the last 70 years or so, has endorsed a view on the nature of speech for which no evidence can be adduced and which has resisted falsification through active ad hoc “theoretical repair” carried by speech scientists. We show that the author’s qualms on the putative dogmatic status of speech research are utterly unwarranted, if not misconstrued as a whole. On more general grounds, the present article can be understood as a work on the rather underdeveloped area of the philosophy and history of Linguistics.
Syrett, Kristen; Kawahara, Shigeto
In this paper, we ask whether children are sensitive to the needs of their interlocutor, and, if so, whether they - like adults - modify acoustic characteristics of their speech as part of a communicative goal. In a production task, preschoolers participated in a word learning task that favored the use of clear speech. Children produced vowels that were longer, more intense, more dispersed in the vowel space, and had a more expanded F0 range than normal speech. Two perception studies with adults showed that these acoustic differences were perceptible and were used to distinguish normal and clear speech styles. We conclude that preschoolers are sensitive to aspects of the speaker-hearer relationship calling upon them to modify their speech in ways that benefit their listener.
Bogacka, Anna A.
The experiment examined to what acoustic cues Polish learners of English pay attention when distinguishing between English high vowels. Predictions concerned the influence of Polish vowel system (no duration differences and only one vowel in the high back vowel region), salience of duration cues and L1 orthography. Thirty-seven Polish subjects and a control group of English native speakers identified stimuli from heed-hid and who'd-hood continua varying in spectral and duration steps. Identification scores by spectral and duration steps, and F1/F2 plots of identifications, were given as well as fundamental frequency variation comments. English subjects strongly relied on spectral cues (typical categorical perception) and almost did not react to temporal cues. Polish subjects relied strongly on temporal cues for both continua, but showed a reversed pattern of identification of who'd-hood contrast. Their reliance on spectral cues was weak and had a reversed pattern for heed-hid contrast. The results were interpreted with reference to the speech learning model [Flege (1995)], perceptual assimilation model [Best (1995)] and ontogeny phylogeny model [Major (2001)].
Larisa Evgenevna Deryagina
Full Text Available Article is devoted to identification of psychoacoustic differences between languages of Roman, Germanic and Slavic groups, as factors that hinder the learning of foreign languages and EEG-correlates of perception and recognition of foreign speech, as the process of communication. We used theoretico- methodological analysis of psycholinguistic data, psychoacoustic and physiological (own studies. It was determined that the acoustic characteristics of foreign speech affect cerebration and form s of its functioning through the auditory sensory system. Prosodic and articulatory system of the native language has a significant influence on the perception of foreign speech. Patterns of foreign language speech perception are based on different functions of the cerebral hemispheres. Differences in hemispheric organization of the brain can have a significant impact on the effectiveness of learning languages belonging to the Roman, Germanic and Slavic groups, having acoustic and rhythmical-melodic features.
Ingvalson, Erin M; Dhar, Sumitrajit; Wong, Patrick C M; Liu, Hanjun
Working memory capacity has been linked to performance on many higher cognitive tasks, including the ability to perceive speech in noise. Current efforts to train working memory have demonstrated that working memory performance can be improved, suggesting that working memory training may lead to improved speech perception in noise. A further advantage of working memory training to improve speech perception in noise is that working memory training materials are often simple, such as letters or digits, making them easily translatable across languages. The current effort tested the hypothesis that working memory training would be associated with improved speech perception in noise and that materials would easily translate across languages. Native Mandarin Chinese and native English speakers completed ten days of reversed digit span training. Reading span and speech perception in noise both significantly improved following training, whereas untrained controls showed no gains. These data suggest that working memory training may be used to improve listeners' speech perception in noise and that the materials may be quickly adapted to a wide variety of listeners.
Joseph D Crew
Full Text Available Cochlear implant (CI users have difficulty understanding speech in noisy listening conditions and perceiving music. Aided residual acoustic hearing in the contralateral ear can mitigate these limitations. The present study examined contributions of electric and acoustic hearing to speech understanding in noise and melodic pitch perception. Data was collected with the CI only, the hearing aid (HA only, and both devices together (CI+HA. Speech reception thresholds (SRTs were adaptively measured for simple sentences in speech babble. Melodic contour identification (MCI was measured with and without a masker instrument; the fundamental frequency of the masker was varied to be overlapping or non-overlapping with the target contour. Results showed that the CI contributes primarily to bimodal speech perception and that the HA contributes primarily to bimodal melodic pitch perception. In general, CI+HA performance was slightly improved relative to the better ear alone (CI-only for SRTs but not for MCI, with some subjects experiencing a decrease in bimodal MCI performance relative to the better ear alone (HA-only. Individual performance was highly variable, and the contribution of either device to bimodal perception was both subject- and task-dependent. The results suggest that individualized mapping of CIs and HAs may further improve bimodal speech and music perception.
Calcus, Axelle; Lorenzi, Christian; Collet, Gregory; Colin, Cécile; Kolinsky, Régine
Purpose: Children with dyslexia have been suggested to experience deficits in both categorical perception (CP) and speech identification in noise (SIN) perception. However, results regarding both abilities are inconsistent, and the relationship between them is still unclear. Therefore, this study aimed to investigate the relationship between CP…
Most, Tova; Peled, Miriam
This study assessed perception of suprasegmental features of speech by 30 prelingual children with sensorineural hearing loss. Ten children had cochlear implants (CIs), and 20 children wore hearing aids (HA): 10 with severe hearing loss and 10 with profound hearing loss. Perception of intonation, syllable stress, word emphasis, and word pattern…
Full Text Available Synesthesia entails a special kind of sensory perception, where stimulation in one sensory modality leads to an internally generated perceptual experience of another, not stimulated sensory modality. This phenomenon can be viewed as an abnormal multisensory integration process as here the synesthetic percept is aberrantly fused with the stimulated modality. Indeed, recent synesthesia research has focused on multimodal processing even outside of the specific synesthesia-inducing context and has revealed changed multimodal integration, thus suggesting perceptual alterations at a global level. Here, we focused on audio-visual processing in synesthesia using a semantic classification task in combination with visually or auditory-visually presented animated and inanimated objects in an audio-visual congruent and incongruent manner. Fourteen subjects with auditory-visual and/or grapheme-color synesthesia and 14 control subjects participated in the experiment. During presentation of the stimuli, event-related potentials were recorded from 32 electrodes. The analysis of reaction times and error rates revealed no group differences with best performance for audio-visually congruent stimulation indicating the well-known multimodal facilitation effect. We found an enhanced amplitude of the N1 component over occipital electrode sites for synesthetes compared to controls. The differences occurred irrespective of the experimental condition and therefore suggest a global influence on early sensory processing in synesthetes.
Konijn, Elly A.; Walma van der Molen, Juliette H.; van Nes, Sander
This study investigated whether emotions induced in TV-viewers (either as an emotional state or co-occurring with emotional involvement) would increase viewers' perception of realism in a fake documentary and affect the information value that viewers would attribute to its content. To that end, two experiments were conducted that manipulated (a)…
Full Text Available Objective: To determine speech-perception-in-noise (with speech and noise spatially distinct and coincident and bilateral spatial benefits of head-shadow effect, summation, squelch and spatial release of masking in adults with delayed sequential cochlear implants. Study design: A cross-sectional one group post-test-only exploratory design was employed. Eleven adults (mean age 47 years; range 21 – 69 years of the Pretoria Cochlear Implant Programme (PCIP in South Africa with a bilateral severe-to-profound sensorineural hearing loss were recruited. Prerecorded Everyday Speech Sentences of The Central Institute for the Deaf (CID were used to evaluate participants’ speech-in-noise perception at sentence level. An adaptive procedure was used to determine the signal-to-noise ratio (SNR, in dB at which the participant’s speech reception threshold (SRT was achieved. Specific calculations were used to estimate bilateral spatial benefit effects. Results: A minimal bilateral benefit for speech-in-noise perception was observed with noise directed to the first implant (CI 1 (1.69 dB and in the speech and noise spatial listening condition (0.78 dB, but was not statistically significant. The head-shadow effect at 180° was the most robust bilateral spatial benefit. An improvement in speech perception in spatially distinct speech and noise indicates the contribution of the second implant (CI 2 is greater than that of the first implant (CI 1 for bilateral spatial benefit. Conclusion: Bilateral benefit for delayed sequentially implanted adults is less than previously reported for simultaneous and sequentially implanted adults. Delayed sequential implantation benefit seems to relate to the availability of the ear with the most favourable SNR.
Full Text Available For deaf individuals with residual low-frequency acoustic hearing, combined use of a cochlear implant (CI and hearing aid (HA typically provides better speech understanding than with either device alone. Because of coarse spectral resolution, CIs do not provide fundamental frequency (F0 information that contributes to understanding of tonal languages such as Mandarin Chinese. The HA can provide good representation of F0 and, depending on the range of aided acoustic hearing, first and second formant (F1 and F2 information. In this study, Mandarin tone, vowel, and consonant recognition in quiet and noise was measured in 12 adult Mandarin-speaking bimodal listeners with the CI-only and with the CI+HA. Tone recognition was significantly better with the CI+HA in noise, but not in quiet. Vowel recognition was significantly better with the CI+HA in quiet, but not in noise. There was no significant difference in consonant recognition between the CI-only and the CI+HA in quiet or in noise. There was a wide range in bimodal benefit, with improvements often greater than 20 percentage points in some tests and conditions. The bimodal benefit was compared to CI subjects' HA-aided pure-tone average (PTA thresholds between 250 and 2000 Hz; subjects were divided into two groups: "better" PTA (50 dB HL. The bimodal benefit differed significantly between groups only for consonant recognition. The bimodal benefit for tone recognition in quiet was significantly correlated with CI experience, suggesting that bimodal CI users learn to better combine low-frequency spectro-temporal information from acoustic hearing with temporal envelope information from electric hearing. Given the small number of subjects in this study (n = 12, further research with Chinese bimodal listeners may provide more information regarding the contribution of acoustic and electric hearing to tonal language perception.
Full Text Available The aim of the study was to assess the effect of short-term musical training on speech perception in noise. In the present study speech perception in noise was measured pre- and post- short-term musical training. The musical training involved auditory perceptual training for raga identification of two Carnatic ragas. The training was given for eight sessions. A total of 18 normal hearing adults in the age range of 18-25 years participated in the study wherein group 1 consisted of ten individuals who underwent musical training and group 2 consisted of eight individuals who did not undergo any training. Results revealed that post training, speech perception in noise improved significantly in group 1, whereas group 2 did not show any changes in speech perception scores. Thus, short-term musical training shows an enhancement of speech perception in the presence of noise. However, generalization and long-term maintenance of these benefits needs to be evaluated.
The general topic addressed by this dissertation is that of bilingualism, and more specifically, the topic of bilingual acquisition of speech sounds. The central question in this study is the following: does bilingualism affect children’s perceptual development of speech sounds? The term bilingual i
Scott, Sophie K.; Wise, Richard J. S.
In this paper we attempt to relate the prelexical processing of speech, with particular emphasis on functional neuroimaging studies, to the study of auditory perceptual systems by disciplines in the speech and hearing sciences. The elaboration of the sound-to-meaning pathways in the human brain enables their integration into models of the human…
Drullman, Rob; Bronkhorst, Adelbert W.
Speech intelligibility was investigated by varying the number of interfering talkers, level, and mean pitch differences between target and interfering speech, and the presence of tactile support. In a first experiment the speech-reception threshold (SRT) for sentences was measured for a male talker against a background of one to eight interfering male talkers or speech noise. Speech was presented diotically and vibro-tactile support was given by presenting the low-pass-filtered signal (0-200 Hz) to the index finger. The benefit in the SRT resulting from tactile support ranged from 0 to 2.4 dB and was largest for one or two interfering talkers. A second experiment focused on masking effects of one interfering talker. The interference was the target talker's own voice with an increased mean pitch by 2, 4, 8, or 12 semitones. Level differences between target and interfering speech ranged from -16 to +4 dB. Results from measurements of correctly perceived words in sentences show an intelligibility increase of up to 27% due to tactile support. Performance gradually improves with increasing pitch difference. Louder target speech generally helps perception, but results for level differences are considerably dependent on pitch differences. Differences in performance between noise and speech maskers and between speech maskers with various mean pitches are explained by the effect of informational masking. .
Full Text Available It has traditionally been assumed that cochlear implant users de facto perform atypically in audiovisual tasks. However, a recent study that combined an auditory task with visual distractors suggests that only those cochlear implant users that are not proficient at recognizing speech sounds might show abnormal audiovisual interactions. The present study aims at reinforcing this notion by investigating the audiovisual segregation abilities of cochlear implant users in a visual task with auditory distractors. Speechreading was assessed in two groups of cochlear implant users (proficient and non-proficient at sound recognition, as well as in normal controls. A visual speech recognition task (i.e. speechreading was administered either in silence or in combination with three types of auditory distractors: i noise ii reverse speech sound and iii non-altered speech sound. Cochlear implant users proficient at speech recognition performed like normal controls in all conditions, whereas non-proficient users showed significantly different audiovisual segregation patterns in both speech conditions. These results confirm that normal-like audiovisual segregation is possible in highly skilled cochlear implant users and, consequently, that proficient and non-proficient CI users cannot be lumped into a single group. This important feature must be taken into account in further studies of audiovisual interactions in cochlear implant users.
Rogalsky, Corianne; Love, Tracy; Driscoll, David; Anderson, Steven W; Hickok, Gregory
The discovery of mirror neurons in macaque has led to a resurrection of motor theories of speech perception. Although the majority of lesion and functional imaging studies have associated perception with the temporal lobes, it has also been proposed that the 'human mirror system', which prominently includes Broca's area, is the neurophysiological substrate of speech perception. Although numerous studies have demonstrated a tight link between sensory and motor speech processes, few have directly assessed the critical prediction of mirror neuron theories of speech perception, namely that damage to the human mirror system should cause severe deficits in speech perception. The present study measured speech perception abilities of patients with lesions involving motor regions in the left posterior frontal lobe and/or inferior parietal lobule (i.e., the proposed human 'mirror system'). Performance was at or near ceiling in patients with fronto-parietal lesions. It is only when the lesion encroaches on auditory regions in the temporal lobe that perceptual deficits are evident. This suggests that 'mirror system' damage does not disrupt speech perception, but rather that auditory systems are the primary substrate for speech perception.
Dole, Marjorie; Hoen, Michel; Meunier, Fanny
Developmental dyslexia is associated with impaired speech-in-noise perception. The goal of the present research was to further characterize this deficit in dyslexic adults. In order to specify the mechanisms and processing strategies used by adults with dyslexia during speech-in-noise perception, we explored the influence of background type,…
Full Text Available The present study examined to what extent proficiency in a non-native language influences speech perception in noise. We explored how English proficiency affected native (Swedish and non-native (English speech perception in four speech reception threshold (SRT conditions including two energetic (stationary, fluctuating noise and two informational (two-talker babble Swedish, two-talker babble English maskers. Twenty-three normal-hearing native Swedish listeners participated, age between 28 and 64 years. The participants also performed standardized tests in English proficiency, non-verbal reasoning and working memory capacity. Our approach with focus on proficiency and the assessment of external as well as internal, listener-related factors allowed us to examine which variables explained intra-and interindividual differences in native and non-native speech perception performance. The main result was that in the non-native target, the level of English proficiency is a decisive factor for speech intelligibility in noise. High English proficiency improved performance in all four conditions when target language was English. The informational maskers were interfering more with perception than energetic maskers, specifically in the non-native language. The study also confirmed that the SRT's were better when target language was native compared to non-native.
Gick, Bryan; Ikegami, Yoko; Derrick, Donald
Asynchronous cross-modal information is integrated asymmetrically in audio-visual perception. To test whether this asymmetry generalizes across modalities, auditory (aspirated “pa” and unaspirated “ba” stops) and tactile (slight, inaudible, cutaneous air puffs) signals were presented synchronously and asynchronously. Results were similar to previous AV studies: the temporal window of integration for the enhancement effect (but not the interference effect) was asymmetrical, allowing up to 200 ms of asynchrony when the puff followed the audio signal, but only up to 50 ms when the puff preceded the audio signal. These findings suggest that perceivers accommodate differences in physical transmission speed of different multimodal signals. PMID:21110549
You, R S; Serniclaes, W; Rider, D; Chabane, N
Previous studies have claimed to show deficits in the perception of speech sounds in autism spectrum disorders (ASD). The aim of the current study was to clarify the nature of such deficits. Children with ASD might only exhibit a lesser amount of precision in the perception of phoneme categories (CPR deficit). However, these children might further present an allophonic mode of speech perception, similar to the one evidenced in dyslexia, characterised by enhanced discrimination of acoustic differences within phoneme categories. Allophonic perception usually gives rise to a categorical perception (CP) deficit, characterised by a weaker coherence between discrimination and identification of speech sounds. The perceptual performance of ASD children was compared to that of control children of the same chronological age. Identification and discrimination data were collected for continua of natural vowels, synthetic vowels, and synthetic consonants. Results confirmed that children with ASD exhibit a CPR deficit for the three stimulus continua. These children further exhibited a trend toward allophonic perception that was, however, not accompanied by the usual CP deficit. These findings confirm that the commonly found CPR deficit is also present in ASD. Whether children with ASD also present allophonic perception requires further investigations.
Jarick, Michelle; Jones, Jeffery A
Research demonstrates that listening to and viewing speech excites tongue and lip motor areas involved in speech production. This perceptual-motor relationship was investigated behaviourally by presenting video clips of a speaker producing vowel-consonant-vowel syllables in three conditions: visual-only, audio-only, and audiovisual. Participants identified target letters that were flashed over the mouth during the video, either manually or verbally as quickly as possible. Verbal responses were fastest when the target matched the speech stimuli in all modality conditions, yet optimal facilitation was observed when participants were presented with visual-only stimuli. Critically, no such facilitation occurred when participants were asked to identify the target manually. Our findings support previous research suggesting a close relationship between speech perception and production by demonstrating that viewing speech can 'prime' our motor system for subsequent speech production.
Cristia, A.; Seidl, A.; Junge, C.M.M.; Soderstrom, M.; Hagoort, P.
There are increasing reports that individual variation in behavioral and neurophysiological measures of infant speech processing predicts later language outcomes, and specifically concurrent or subsequent vocabulary size. If such findings are held up under scrutiny, they could both illuminate theore
Davis, Matthew H.; Coleman, Martin R.; Absalom, Anthony R.; Rodd, Jennifer M.; Johnsrude, Ingrid S.; Matta, Basil F.; Owen, Adrian M.; Menon, David K.
We used functional MRI and the anesthetic agent propofol to assess the relationship among neural responses to speech, successful comprehension, and conscious awareness. Volunteers were scanned while listening to sentences containing ambiguous words, matched sentences without ambiguous words, and sig
A. Christia; A. Seidl; C. Junge; M. Soderstrom; P. Hagoort
There are increasing reports that individual variation in behavioral and neurophysiological measures of infant speech processing predicts later language outcomes, and specifically concurrent or subsequent vocabulary size. If such findings are held up under scrutiny, they could both illuminate theore
Osnes, Berge; Hugdahl, Kenneth; Specht, Karsten
Several reports of premotor cortex involvement in speech perception have been put forward. Still, the functional role of premotor cortex is under debate. In order to investigate the functional role of premotor cortex, we presented parametrically varied speech stimuli in both a behavioral and functional magnetic resonance imaging (fMRI) study. White noise was transformed over seven distinct steps into a speech sound and presented to the participants in a randomized order. As control condition served the same transformation from white noise into a music instrument sound. The fMRI data were modelled with Dynamic Causal Modeling (DCM) where the effective connectivity between Heschl's gyrus, planum temporale, superior temporal sulcus and premotor cortex were tested. The fMRI results revealed a graded increase in activation in the left superior temporal sulcus. Premotor cortex activity was only present at an intermediate step when the speech sounds became identifiable but were still distorted but was not present when the speech sounds were clearly perceivable. A Bayesian model selection procedure favored a model that contained significant interconnections between Heschl's gyrus, planum temporal, and superior temporal sulcus when processing speech sounds. In addition, bidirectional connections between premotor cortex and superior temporal sulcus and from planum temporale to premotor cortex were significant. Processing non-speech sounds initiated no significant connections to premotor cortex. Since the highest level of motor activity was observed only when processing identifiable sounds with incomplete phonological information, it is concluded that premotor cortex is not generally necessary for speech perception but may facilitate interpreting a sound as speech when the acoustic input is sparse.
Astheimer, Lori B.; Berkes, Matthias; Bialystok, Ellen
Attention is required during speech perception to focus processing resources on critical information. Previous research has shown that bilingualism modifies attentional processing in nonverbal domains. The current study used event-related potentials (ERPs) to determine whether bilingualism also modifies auditory attention during speech perception. We measured attention to word onsets in spoken English for monolinguals and Chinese-English bilinguals. Auditory probes were inserted at four times in a continuous narrative: concurrent with word onset, 100 ms before or after onset, and at random control times. Greater attention was indexed by an increase in the amplitude of the early negativity (N1). Among monolinguals, probes presented after word onsets elicited a larger N1 than control probes, replicating previous studies. For bilinguals, there was no N1 difference for probes at different times around word onsets, indicating less specificity in allocation of attention. These results suggest that bilingualism shapes attentional strategies during English speech comprehension. PMID:27110579
Blood, Gordon W.; Boyle, Michael P.; Blood, Ingrid M.; Nalesnik, Gina R.
Bullying in school-age children is a global epidemic. School personnel play a critical role in eliminating this problem. The goals of this study were to examine speech-language pathologists' (SLPs) perceptions of bullying, endorsement of potential strategies for dealing with bullying, and associations among SLPs' responses and specific demographic…
Schomers, Malte R.; Pulvermüller, Friedemann
In the neuroscience of language, phonemes are frequently described as multimodal units whose neuronal representations are distributed across perisylvian cortical regions, including auditory and sensorimotor areas. A different position views phonemes primarily as acoustic entities with posterior temporal localization, which are functionally independent from frontoparietal articulatory programs. To address this current controversy, we here discuss experimental results from functional magnetic resonance imaging (fMRI) as well as transcranial magnetic stimulation (TMS) studies. On first glance, a mixed picture emerges, with earlier research documenting neurofunctional distinctions between phonemes in both temporal and frontoparietal sensorimotor systems, but some recent work seemingly failing to replicate the latter. Detailed analysis of methodological differences between studies reveals that the way experiments are set up explains whether sensorimotor cortex maps phonological information during speech perception or not. In particular, acoustic noise during the experiment and ‘motor noise’ caused by button press tasks work against the frontoparietal manifestation of phonemes. We highlight recent studies using sparse imaging and passive speech perception tasks along with multivariate pattern analysis (MVPA) and especially representational similarity analysis (RSA), which succeeded in separating acoustic-phonological from general-acoustic processes and in mapping specific phonological information on temporal and frontoparietal regions. The question about a causal role of sensorimotor cortex on speech perception and understanding is addressed by reviewing recent TMS studies. We conclude that frontoparietal cortices, including ventral motor and somatosensory areas, reflect phonological information during speech perception and exert a causal influence on language understanding. PMID:27708566
Studdert-Kennedy, Michael; Mody, Maria; Brady, Susan
This rejoinder to a critique of the authors' research on speech perception deficits in poor readers answers the specific criticisms and reaffirms their conclusion that the difficulty some poor readers have with rapid /ba/-/da/ discrimination does not stem from difficulty in discriminating the rapid spectral transitions at stop-vowel syllable…
Chi Yhun Lo
Full Text Available Cochlear implant (CI recipients generally have good perception of speech in quiet environments but difficulty perceiving speech in noisy conditions, reduced sensitivity to speech prosody, and difficulty appreciating music. Auditory training has been proposed as a method of improving speech perception for CI recipients, and recent efforts have focussed on the potential benefits of music-based training. This study evaluated two melodic contour training programs and their relative efficacy as measured on a number of speech perception tasks. These melodic contours were simple 5-note sequences formed into 9 contour patterns, such as “rising” or “rising-falling.” One training program controlled difficulty by manipulating interval sizes, the other by note durations. Sixteen adult CI recipients (aged 26–86 years and twelve normal hearing (NH adult listeners (aged 21–42 years were tested on a speech perception battery at baseline and then after 6 weeks of melodic contour training. Results indicated that there were some benefits for speech perception tasks for CI recipients after melodic contour training. Specifically, consonant perception in quiet and question/statement prosody was improved. In comparison, NH listeners performed at ceiling for these tasks. There was no significant difference between the posttraining results for either training program, suggesting that both conferred benefits for training CI recipients to better perceive speech.
Full Text Available A recent opinion article (Neural oscillations in speech: don’t be enslaved by the envelope. Obleser et al., 2012 questions the validity of a class of speech perception models inspired by the possible role of neuronal oscillations in decoding speech (e.g., Ghitza 2011, Giraud & Poeppel 2012. They criticize, in particular, what they see as the over-emphasis of the role of temporal speech envelope information, and the over-emphasis of entrainment to the input rhythm while neglecting the role of top-down processes in modulating the entrainment of neuronal oscillations. Here we respond to these arguments, referring to the phenomenological model of Ghitza (2011, taken as a representative of the criticized approach.
García, Paula B; Rosado Rogers, Lydia; Nishi, Kanae
This study evaluated the English version of Computer-Assisted Speech Perception Assessment (E-CASPA) with Spanish-English bilingual children. E-CASPA has been evaluated with monolingual English speakers ages 5 years and older, but it is unknown whether a separate norm is necessary for bilingual children. Eleven Spanish-English bilingual and 12 English monolingual children (6 to 12 years old) with normal hearing participated. Responses were scored by word, phoneme, consonant, and vowel. Regardless of scores, performance across three signal-to-noise ratio conditions was similar between groups, suggesting that the same norm can be used for both bilingual and monolingual children.
Knowland, Victoria C. P.; Evans, Sam; Snell, Caroline; Rosen, Stuart
Purpose: The purpose of the study was to assess the ability of children with developmental language learning impairments (LLIs) to use visual speech cues from the talking face. Method: In this cross-sectional study, 41 typically developing children (mean age: 8 years 0 months, range: 4 years 5 months to 11 years 10 months) and 27 children with…
Sadakata, M.; Zanden, L.D.T. van der; Sekiyama, K.
The current study reports specific cases in which a positive transfer of perceptual ability from the music domain to the language domain occurs. We tested whether musical training enhances discrimination and identification performance of L2 speech sounds (timing features, nasal consonants and vowels
Garcia Lecumberri, M.L.; Cooke, M.P.; Cutler, A.
If listening in adverse conditions is hard, then listening in a foreign language is doubly so: non-native listeners have to cope with both imperfect signals and imperfect knowledge. Comparison of native and non-native listener performance in speech-in-noise tasks helps to clarify the role of prior l
Caldwell, Amanda; Nittrouer, Susan
Purpose: Common wisdom suggests that listening in noise poses disproportionately greater difficulty for listeners with cochlear implants (CIs) than for peers with normal hearing (NH). The purpose of this study was to examine phonological, language, and cognitive skills that might help explain speech-in-noise abilities for children with CIs.…
Approaches to music and audiovisual meaning in film appear to be very different in nature and scope when considered from the point of view of experimental psychology or humanistic studies. Nevertheless, this article argues that experimental studies square with ideas of audiovisual perception and ...
McMurray, Bob; Kovack-Lesh, Kristine A; Goodwin, Dresden; McEchron, William
Infant directed speech (IDS) is a speech register characterized by simpler sentences, a slower rate, and more variable prosody. Recent work has implicated it in more subtle aspects of language development. Kuhl et al. (1997) demonstrated that segmental cues for vowels are affected by IDS in a way that may enhance development: the average locations of the extreme "point" vowels (/a/, /i/ and /u/) are further apart in acoustic space. If infants learn speech categories, in part, from the statistical distributions of such cues, these changes may specifically enhance speech category learning. We revisited this by asking (1) if these findings extend to a new cue (Voice Onset Time, a cue for voicing); (2) whether they extend to the interior vowels which are much harder to learn and/or discriminate; and (3) whether these changes may be an unintended phonetic consequence of factors like speaking rate or prosodic changes associated with IDS. Eighteen caregivers were recorded reading a picture book including minimal pairs for voicing (e.g., beach/peach) and a variety of vowels to either an adult or their infant. Acoustic measurements suggested that VOT was different in IDS, but not in a way that necessarily supports better development, and that these changes are almost entirely due to slower rate of speech of IDS. Measurements of the vowel suggested that in addition to changes in the mean, there was also an increase in variance, and statistical modeling suggests that this may counteract the benefit of any expansion of the vowel space. As a whole this suggests that changes in segmental cues associated with IDS may be an unintended by-product of the slower rate of speech and different prosodic structure, and do not necessarily derive from a motivation to enhance development.
GARCIA, Tatiana Manfrini; JACOB, Regina Tangerino de Souza; MONDELLI, Maria Fernanda Capoani Garcia
ABSTRACT Objective To relate the performance of individuals with hearing loss at high frequencies in speech perception with the quality of life before and after the fitting of an open-fit hearing aid (HA). Methods The WHOQOL-BREF had been used before the fitting and 90 days after the use of HA. The Hearing in Noise Test (HINT) had been conducted in two phases: (1) at the time of fitting without an HA (situation A) and with an HA (situation B); (2) with an HA 90 days after fitting (situation C). Study Sample Thirty subjects with sensorineural hearing loss at high frequencies. Results By using an analysis of variance and the Tukey’s test comparing the three HINT situations in quiet and noisy environments, an improvement has been observed after the HA fitting. The results of the WHOQOL-BREF have showed an improvement in the quality of life after the HA fitting (paired t-test). The relationship between speech perception and quality of life before the HA fitting indicated a significant relationship between speech recognition in noisy environments and in the domain of social relations after the HA fitting (Pearson’s correlation coefficient). Conclusions The auditory stimulation has improved speech perception and the quality of life of individuals. PMID:27383708
Stevenson, Ryan A; Segers, Magali; Ferber, Susanne; Barense, Morgan D; Camarata, Stephen; Wallace, Mark T
A growing area of interest and relevance in the study of autism spectrum disorder (ASD) focuses on the relationship between multisensory temporal function and the behavioral, perceptual, and cognitive impairments observed in ASD. Atypical sensory processing is becoming increasingly recognized as a core component of autism, with evidence of atypical processing across a number of sensory modalities. These deviations from typical processing underscore the value of interpreting ASD within a multisensory framework. Furthermore, converging evidence illustrates that these differences in audiovisual processing may be specifically related to temporal processing. This review seeks to bridge the connection between temporal processing and audiovisual perception, and to elaborate on emerging data showing differences in audiovisual temporal function in autism. We also discuss the consequence of such changes, the specific impact on the processing of different classes of audiovisual stimuli (e.g. speech vs. nonspeech, etc.), and the presumptive brain processes and networks underlying audiovisual temporal integration. Finally, possible downstream behavioral implications, and possible remediation strategies are outlined. Autism Res 2016, 9: 720-738. © 2015 International Society for Autism Research, Wiley Periodicals, Inc.
Mitterer, H.; McQueen, J.M.
Two experiments examined how Dutch listeners deal with the effects of connected-speech processes, specifically those arising from word-final /t/ reduction (e.g., whether Dutch [tas] is tax, bag, or a reduced-/t/ version of last, touch). Eye movements of Dutch participants were tracked as they looked
Mitterer, Holger; McQueen, James M.
Two experiments examined how Dutch listeners deal with the effects of connected-speech processes, specifically those arising from word-final /t/ reduction (e.g., whether Dutch [tas] is "tas," bag, or a reduced-/t/ version of "tast," touch). Eye movements of Dutch participants were tracked as they looked at arrays containing 4…
Farhood, Zachary; Nguyen, Shaun A; Miller, Stephen C; Holcomb, Meredith A; Meyer, Ted A; Rizk, And Habib G
Objective (1) To analyze reported speech perception outcomes in patients with inner ear malformations who undergo cochlear implantation, (2) to review the surgical complications and findings, and (3) to compare the 2 classification systems of Jackler and Sennaroglu. Data Sources PubMed, Scopus (including Embase), Medline, and CINAHL Plus. Review Methods Fifty-nine articles were included that contained speech perception and/or intraoperative data. Cases were differentiated depending on whether the Jackler or Sennaroglu malformation classification was used. A meta-analysis of proportions examined incidences of complete insertion, gusher, and facial nerve aberrancy. For speech perception data, weighted means and standard deviations were calculated for all malformations for short-, medium-, and long-term follow-up. Speech tests were grouped into 3 categories-closed-set words, open-set words, and open-set sentences-and then compared through a comparison-of-means t test. Results Complete insertion was seen in 81.8% of all inner ear malformations (95% CI: 72.6-89.5); gusher was reported in 39.1% of cases (95% CI: 30.3-48.2); and facial nerve anomalies were encountered in 34.4% (95% CI: 20.1-50.3). Significant improvements in average performance were seen for closed- and open-set tests across all malformation types at 12 months postoperatively. Conclusions Cochlear implantation outcomes are favorable for those with inner ear malformations from a surgical and speech outcome standpoint. Accurate classification of anatomic malformations, as well as standardization of postimplantation speech outcomes, is necessary to improve understanding of the impact of implantation in this difficult patient population.
Jianwu Dang; Masato Akagi; Kiyoshi Honda
Realization of an intelligent human-machine interface requires us to investigate human mechanisms and learn from them. This study focuses on communication between speech production and perception within human brain and realizing it in an artificial system. A physiological research study based on electromyographic signals (Honda, 1996) suggested that speech communication in human brain might be based on a topological mapping between speech production and perception, according to an analogous topology between motor and sensory representations. Following this hypothesis, this study first investigated the topologies of the vowel system across the motor, kinematic, and acoustic spaces by means of a model simulation, and then examined the linkage between vowel production and perception in terms of a transformed auditory feedback (TAF) experiment. The model simulation indicated that there exists an invariant mapping from muscle activations (motor space) to articulations (kinematic space) via a coordinate consisting of force-dependent equilibrium positions, and the mapping from the motor space to kinematic space is unique. The motor-kinematic-acoustic deduction in the model simulation showed that the topologies were compatible from one space to another. In the TAF experiment, vowel production exhibited a compensatory response for a perturbation in the feedback sound. This implied that vowel production is controlled in reference to perception monitoring.
Whistled speech is a little studied local use of language shaped by several cultures of the world either for distant dialogues or for rendering traditional songs. This practice consists of an emulation of the voice thanks to a simple modulated pitch. It is therefore the result of a transformation of the vocal signal that implies simplifications in the frequency domain. The whistlers adapt their productions to the way each language combines the qualities of height perceived simultaneously by the human ear in the complex frequency spectrum of the spoken or sung voice (pitch, timbre). As a consequence, this practice underlines key acoustic cues for the intelligibility of the concerned languages. The present study provides an analysis of the acoustic and phonetic features selected by whistled speech in several traditions either in purely oral whistles (Spanish, Turkish, Mazatec) or in whistles produced with an instrument like a leaf (Akha, Hmong). It underlines the convergences with the strategies of the singing ...
Tanta, Ivan; Lesinger, Gordana
Key place in this paper takes the study of political speech in the Republic of Croatia and their impact on voters, or which keywords are in political speeches and public appearances of politicians in Croatia that their voting body wants to hear. Given listed below we will define the research topic in the form of a question - is there a discrepancy in the perception of the public-political speech in Croatia, and which keywords are specific to the two main regions in Croatia and that inhabitant these regions respond. Marcus Tullius Cicero, the most important Roman orator, he used a specific associative mnemonic technique that is called "technique room". He would talk expound on keywords and conceptual terms that he needed for the desired topic and join in these make them, according to the desired order, in a very creative and unique way, the premises of the house or palace, which he knew well. Then, while holding the speech intended to pass through rooms of the house or palace and then put keywords and concepts come to mind, again according to the desired order. Given that this is a specific kind of research political speech that is relatively recent in Croatia, it should be noted that there is still, this kind of political communication is not sufficiently explored. Particularly the emphasis on the impact and use of keywords specific to the Republic of Croatia, in everyday public and political communication. The paper will be analyzed the political, campaign speeches and promises several winning candidates, and now Croatian MEPs, specific keywords related to: economics, culture, science, education and health. The analysis is based on comparison of the survey results on the representation of key words in the speeches of politicians and qualitative analysis of the speeches of politicians on key words during the election campaign.
Bidelman, Gavin M
Musical training is associated with behavioral and neurophysiological enhancements in auditory processing for both musical and nonmusical sounds (e.g., speech). Yet, whether the benefits of musicianship extend beyond enhancements to auditory-specific skills and impact multisensory (e.g., audiovisual) processing has yet to be fully validated. Here, we investigated multisensory integration of auditory and visual information in musicians and nonmusicians using a double-flash illusion, whereby the presentation of multiple auditory stimuli (beeps) concurrent with a single visual object (flash) induces an illusory perception of multiple flashes. We parametrically varied the onset asynchrony between auditory and visual events (leads and lags of ±300 ms) to quantify participants' "temporal window" of integration, i.e., stimuli in which auditory and visual cues were fused into a single percept. Results show that musically trained individuals were both faster and more accurate at processing concurrent audiovisual cues than their nonmusician peers; nonmusicians had a higher susceptibility for responding to audiovisual illusions and perceived double flashes over an extended range of onset asynchronies compared to trained musicians. Moreover, temporal window estimates indicated that musicians' windows (audiovisual binding. Collectively, findings indicate a more refined binding of auditory and visual cues in musically trained individuals. We conclude that experience-dependent plasticity of intensive musical experience extends beyond simple listening skills, improving multimodal processing and the integration of multiple sensory systems in a domain-general manner.
David E Jenson
Full Text Available Sensorimotor integration within the dorsal stream enables online monitoring of speech. Jenson et al. (2014 used independent component analysis (ICA and event related spectral perturbation (ERSP analysis of EEG data to describe anterior sensorimotor (e.g., premotor cortex; PMC activity during speech perception and production. The purpose of the current study was to identify and temporally map neural activity from posterior (i.e., auditory regions of the dorsal stream in the same tasks. Perception tasks required ‘active’ discrimination of syllable pairs (/ba/ and /da/ in quiet and noisy conditions. Production conditions required overt production of syllable pairs and nouns. ICA performed on concatenated raw 68 channel EEG data from all tasks identified bilateral ‘auditory’ alpha (α components in 15 of 29 participants localized to pSTG (left and pMTG (right. ERSP analyses were performed to reveal fluctuations in the spectral power of the α rhythm clusters across time. Production conditions were characterized by significant α event related synchronization (ERS; pFDR < .05 concurrent with EMG activity from speech production, consistent with speech-induced auditory inhibition. Discrimination conditions were also characterized by α ERS following stimulus offset. Auditory α ERS in all conditions also temporally aligned with PMC activity reported in Jenson et al. (2014. These findings are indicative of speech-induced suppression of auditory regions, possibly via efference copy. The presence of the same pattern following stimulus offset in discrimination conditions suggests that sensorimotor contributions following speech perception reflect covert replay, and that covert replay provides one source of the motor activity previously observed in some speech perception tasks. To our knowledge, this is the first time that inhibition of auditory regions by speech has been observed in real-time with the ICA/ERSP technique.
Foote, Jennifer A; Trofimovich, Pavel
Second language speech learning is predicated on learners' ability to notice differences between their own language output and that of their interlocutors. Because many learners interact primarily with other second language users, it is crucial to understand which dimensions underlie the perception of second language speech by learners, compared to native speakers. For this study, 15 non-native and 10 native English speakers rated 30-s language audio-recordings from controlled reading and interview tasks for dissimilarity, using all pairwise combinations of recordings. PROXSCAL multidimensional scaling analyses revealed fluency and aspects of speakers' pronunciation as components underlying listener judgments but showed little agreement across listeners. Results contribute to an understanding of why second language speech learning is difficult and provide implications for language training.
Feijoo, Sergio; Fernandez, Santiago; Alvarez, Jose Manuel
The combined effects of excessive ambient noise and reverberation in classrooms interfere with speech recognition and tend to degrade the learning process of young children. This paper reports a detailed analysis of a speech recognition test carried out with two different children populations of ages 8-9 and 10-11. Unlike English, Spanish has few minimal pairs to be used for phoneme recognition in a closed set manner. The test consisted in a series of two-syllable nonsense words formed by the combination of all possible syllables in Spanish. The test was administered to the children as a dictation task in which they had to write down the words spoken by their female teacher. The test was administered in two blocks on different days, and later repeated to analyze its consistency. The rationale for this procedure was (a) the test should reproduce normal academic situations, (b) all phonological and lexical context effects should be avoided, (c) errors in both words and phonemes should be scored to unveil any possible acoustic base for them. Although word recognition scores were similar among age groups and repetitions, phoneme errors showed high variability questioning the validity of such a test for classroom assessment.
Padgitt, Noelle R.; Munson, Benjamin; Carney, Edward J.
University instruction in phonetics requires students to associate a set of quasialphabetic symbols and diacritics with speech sounds. In the case of narrow phonetic transcription, students are required to associate symbols with sounds that do not function contrastively in the language. This learning task is challenging, given that students must discriminate among different variants of sounds that are not used to convey differences in lexical meaning. Consequently, many students fail to learn phonetic transcription to a level of proficiency needed for practical application (B. Munson and K. N. Brinkman, Am. J. Speech Lang. Path. ). In an effort to improve students' phonetic transcription skills, a computerized training program was developed to trains students' discrimination and identification of selected phonetic contrasts. The design of the training tool was based on similar tools that have been used to train phonetic contrasts in second-language learners of English (e.g., A. Bradlow et al., J. Acoust. Soc. Am. 102, 3115 ). It consists of multiple stages (bombardment, discrimination, identification) containing phonetic contrasts that students have identified as particularly difficult to perceive. This presentation will provide a demonstration of the training tool, and will present preliminary data on the efficacy of this tool in improving students' phonetic transcription abilities.
Tong, Xiuhong; McBride, Catherine; Lee, Chia-Ying; Zhang, Juan; Shuai, Lan; Maurer, Urs; Chung, Kevin K H
Using a multiple-deviant oddball paradigm, this study examined second graders' brain responses to Cantonese speech. We aimed to address the question of whether a change in a consonant or lexical tone could be automatically detected by children. We measured auditory mismatch responses to place of articulation and voice onset time (VOT), reflecting segmental perception, as well as Cantonese lexical tones including level tone and contour tone, reflecting suprasegmental perception. The data showed that robust mismatch negativities (MMNs) were elicited by all deviants in the time window of 300-500 ms in second graders. Moreover, relative to the standard stimuli, the VOT deviant elicited a robust positive mismatch response, and the level tone deviant elicited a significant MMN in the time window of 150-300 ms. The findings suggest that Hong Kong second graders were sensitive to neural discriminations of speech sounds both at the segmental and suprasegmental levels.
Parviainen, Tiina; Helenius, Päivi; Poskiparta, Elisa; Niemi, Pekka; Salmelin, Riitta
Speech processing skills go through intensive development during mid-childhood, providing basis also for literacy acquisition. The sequence of auditory cortical processing of speech has been characterized in adults, but very little is known about the neural representation of speech sound perception in the developing brain. We used whole-head magnetoencephalography (MEG) to record neural responses to speech and nonspeech sounds in first-graders (7-8-year-old) and compared the activation sequence to that in adults. In children, the general location of neural activity in the superior temporal cortex was similar to that in adults, but in the time domain the sequence of activation was strikingly different. Cortical differentiation between sound types emerged in a prolonged response pattern at about 250 ms after sound onset, in both hemispheres, clearly later than the corresponding effect at about 100 ms in adults that was detected specifically in the left hemisphere. Better reading skills were linked with shorter-lasting neural activation, speaking for interdependence of the maturing neural processes of auditory perception and developing linguistic skills. This study uniquely utilized the potential of MEG in comparing both spatial and temporal characteristics of neural activation between adults and children. Besides depicting the group-typical features in cortical auditory processing, the results revealed marked interindividual variability in children.
Full Text Available In obstetrics postoperative cognitive dysfunctions may take place after caesarean section and vaginal delivery with poor results both for mother and child. The goal was to study influence of anesthesia techniques following caesarian section on memory, perception and speech. Having agreed with local ethics committee and obtained informed consent depending on anesthesia method, pregnant women were divided into 2 groups: 1st group (n=31 had spinal anesthesia, 2nd group (n=34 – total intravenous anesthesia. Spinal anesthesia: 1.8-2.2 mLs of hyperbaric 0.5% bupivacaine. ТIVА: Thiopental sodium (4 mgs kg-1, succinylcholine (1-1.5 mgs kg-1. Phentanyl (10-5-3 µgs kg-1 hour and Diazepam (10 mgs were used after newborn extraction. We used Luria’s test for memory assessment, perception was studied by test “recognition of time”. Speech was studied by test "name of fingers". Control points: 1 - before the surgery, 2 - in 24h after the caesarian section, 3 - on day 3 after surgery, 4 - at discharge from hospital (5-7th day. The study showed that initially decreased memory level in expectant mothers regressed along with the time after caesarean section. Memory is restored in 3 days after surgery regardless of anesthesia techniques. In spinal anesthesia on 5-7th postoperative day memory level exceeds that of used in total intravenous anesthesia. The perception and speech do not depend on the term of postoperative period. Anesthesia technique does not influence perception and speech restoration after caesarean sections.
Full Text Available Event-related potential (ERP evidence demonstrates that preschool-aged children selectively attend to informative moments such as word onsets during speech perception. Although this observation indicates a role for attention in language processing, it is unclear whether this type of attention is part of basic speech perception mechanisms, higher-level language skills, or general cognitive abilities. The current study examined these possibilities by measuring ERPs from 5-year-old children listening to a narrative containing attention probes presented before, during, and after word onsets as well as at random control times. Children also completed behavioral tests assessing verbal and nonverbal skills. Probes presented after word onsets elicited a more negative ERP response beginning around 100 ms after probe onset than control probes, indicating increased attention to word-initial segments. Crucially, the magnitude of this difference was correlated with performance on verbal tasks, but showed no relationship to nonverbal measures. More specifically, ERP attention effects were most strongly correlated with performance on a complex metalinguistic task involving grammaticality judgments. These results demonstrate that effective allocation of attention during speech perception supports higher-level, controlled language processing in children by allowing them to focus on relevant information at individual word and complex sentence levels.
Astheimer, Lori; Janus, Monika; Moreno, Sylvain; Bialystok, Ellen
Event-related potential (ERP) evidence demonstrates that preschool-aged children selectively attend to informative moments such as word onsets during speech perception. Although this observation indicates a role for attention in language processing, it is unclear whether this type of attention is part of basic speech perception mechanisms, higher-level language skills, or general cognitive abilities. The current study examined these possibilities by measuring ERPs from 5-year-old children listening to a narrative containing attention probes presented before, during, and after word onsets as well as at random control times. Children also completed behavioral tests assessing verbal and nonverbal skills. Probes presented after word onsets elicited a more negative ERP response beginning around 100 ms after probe onset than control probes, indicating increased attention to word-initial segments. Crucially, the magnitude of this difference was correlated with performance on verbal tasks, but showed no relationship to nonverbal measures. More specifically, ERP attention effects were most strongly correlated with performance on a complex metalinguistic task involving grammaticality judgments. These results demonstrate that effective allocation of attention during speech perception supports higher-level, controlled language processing in children by allowing them to focus on relevant information at individual word and complex sentence levels. PMID:24316548
The preservation of QoS for multimedia traffic through a data network is a difficult problem. We focus our attention on video frame rate and study its influence on speech perception. When sound and picture are discrepant (e.g., acoustic `ba' combined with visual `ga'), subjects perceive a different sound (such as `da'). This phenomenon is known as the McGurk effect. In this paper, the influence of degraded video frame rate on speech perception was studied. It was shown that when frame rate decreases, correct hearing is improved for discrepant stimuli and is degraded for congruent (voice and picture are the same) stimuli. Furthermore, we studied the case where lip closure was always captured by the synchronization of sampling time and lip position. In this case, frame rate has little effect on mishearing for congruent stimuli. For discrepant stimuli, mishearing is decreased with degraded frame rate. These results indicate that stiff motion of lips resulting from low frame rate cannot give enough labial information for speech perception. In addition, the effect of delaying the picture to correct for low frame rate was studied. The results, however, were not as definitive as expected because of compound effects related to the synchronization of sound and picture.
Full Text Available The present study attempts to investigate Indonesian EFL teachers' and native English speakers' perceptions of mispronunciations of English sounds by Indonesian EFL learners. For this purpose, a paper-form questionnaire consisting of 32 target mispronunciations was distributed to Indonesian secondary school teachers of English and also to native English speakers. An analysis of the respondents' perceptions has discovered that 14 out of the 32 target mispronunciations are pedagogically significant in pronunciation instruction. A further analysis of the reasons for these major mispronunciations has reconfirmed the prevalence of interference of learners' native language in their English pronunciation as a major cause of mispronunciations. It has also revealed Indonesian EFL teachers' tendency to overestimate the seriousness of their learners' pronunciations. Based on these findings, the study makes suggestions for better English pronunciation teaching in Indonesia or other EFL countries.
Temporal proximity is one of the key factors determining whether events in different modalities are integrated into a unified percept. Sensitivity to audiovisual temporal asynchrony has been studied in adults in great detail. However, how such sensitivity matures during childhood is poorly understood. We examined perception of audiovisual temporal…
Full Text Available The present study attempts to investigate Indonesian EFL teachersâ€™ and native English speakersâ€™ perceptions of mispronunciations of English sounds by Indonesian EFL learners. For this purpose, a paper-form questionnaire consisting of 32 target mispronunciations was distributed to Indonesian secondary school teachers of English and also to native English speakers. An analysis of the respondentsâ€™ perceptions has discovered that 14 out of the 32 target mispronunciations are pedagogically significant in pronunciation instruction. A further analysis of the reasons for these major mispronunciations has reconfirmed the prevalence of interference of learnersâ€™ native language in their English pronunciation as a major cause of mispronunciations. It has also revealed Indonesian EFL teachersâ€™ tendency to overestimate the seriousness of their learnersâ€™ pronunciations. Based on these findings, the study makes suggestions for better English pronunciation teaching in Indonesia or other EFL countries.
Keane, Brian P.; Rosenthal, Orna; Chun, Nicole H.; Shams, Ladan
Autism involves various perceptual benefits and deficits, but it is unclear if the disorder also involves anomalous audiovisual integration. To address this issue, we compared the performance of high-functioning adults with autism and matched controls on experiments investigating the audiovisual integration of speech, spatiotemporal relations, and…
Full Text Available The relationship between the neurobiology of speech and music has been investigated for more than a century. There remains no widespread agreement regarding how (or to what extent music perception utilizes the neural circuitry that is engaged in speech processing, particularly at the cortical level. Prominent models such as Patel’s Shared Syntactic Integration Resource Hypothesis (SSIRH and Koelsch’s neurocognitive model of music perception suggest a high degree of overlap, particularly in the frontal lobe, but also perhaps more distinct representations in the temporal lobe with hemispheric asymmetries. The present meta-analysis study used activation likelihood estimate analyses to identify the brain regions consistently activated for music as compared to speech across the functional neuroimaging (fMRI and PET literature. Eighty music and 91 speech neuroimaging studies of healthy adult control subjects were analyzed. Peak activations reported in the music and speech studies were divided into four paradigm categories: passive listening, discrimination tasks, error/anomaly detection tasks and memory-related tasks. We then compared activation likelihood estimates within each category for music versus speech, and each music condition with passive listening. We found that listening to music and to speech preferentially activate distinct temporo-parietal bilateral cortical networks. We also found music and speech to have shared resources in the left pars opercularis but speech-specific resources in the left pars triangularis. The extent to which music recruited speech-activated frontal resources was modulated by task. While there are certainly limitations to meta-analysis techniques particularly regarding sensitivity, this work suggests that the extent of shared resources between speech and music may be task-dependent and highlights the need to consider how task effects may be affecting conclusions regarding the neurobiology of speech and music.
Scott, Sophie K.; Rosen, Stuart; Wickham, Lindsay; Wise, Richard J. S.
Positron emission tomography (PET) was used to investigate the neural basis of the comprehension of speech in unmodulated noise (``energetic'' masking, dominated by effects at the auditory periphery), and when presented with another speaker (``informational'' masking, dominated by more central effects). Each type of signal was presented at four different signal-to-noise ratios (SNRs) (+3, 0, -3, -6 dB for the speech-in-speech, +6, +3, 0, -3 dB for the speech-in-noise), with listeners instructed to listen for meaning to the target speaker. Consistent with behavioral studies, there was SNR-dependent activation associated with the comprehension of speech in noise, with no SNR-dependent activity for the comprehension of speech-in-speech (at low or negative SNRs). There was, in addition, activation in bilateral superior temporal gyri which was associated with the informational masking condition. The extent to which this activation of classical ``speech'' areas of the temporal lobes might delineate the neural basis of the informational masking is considered, as is the relationship of these findings to the interfering effects of unattended speech and sound on more explicit working memory tasks. This study is a novel demonstration of candidate neural systems involved in the perception of speech in noisy environments, and of the processing of multiple speakers in the dorso-lateral temporal lobes.
In this dissertation I present a model that captures categorical effects in both first language (L1) and second language (L2) speech perception. In L1 perception, categorical effects range between extremely strong for consonants to nearly continuous perception of vowels. I treat the problem of speech perception as a statistical inference problem and by quantifying categoricity I obtain a unified model of both strong and weak categorical effects. In this optimal inference mechanism, the listener uses their knowledge of categories and the acoustics of the signal to infer the intended productions of the speaker. The model splits up speech variability into meaningful category variance and perceptual noise variance. The ratio of these two variances, which I call Tau, directly correlates with the degree of categorical effects for a given phoneme or continuum. By fitting the model to behavioral data from different phonemes, I show how a single parametric quantitative variation can lead to the different degrees of categorical effects seen in perception experiments with different phonemes. In L2 perception, L1 categories have been shown to exert an effect on how L2 sounds are identified and how well the listener is able to discriminate them. Various models have been developed to relate the state of L1 categories with both the initial and eventual ability to process the L2. These models largely lacked a formalized metric to measure perceptual distance, a means of making a-priori predictions of behavior for a new contrast, and a way of describing non-discrete gradient effects. In the second part of my dissertation, I apply the same computational model that I used to unify L1 categorical effects to examining L2 perception. I show that we can use the model to make the same type of predictions as other SLA models, but also provide a quantitative framework while formalizing all measures of similarity and bias. Further, I show how using this model to consider L2 learners at
Full Text Available Animated agents are becoming increasingly frequent in research and applications in speech science. An important challenge is to evaluate the effectiveness of the agent in terms of the intelligibility of its visible speech. In three experiments, we extend and test the Sumby and Pollack (1954 metric to allow the comparison of an agent relative to a standard or reference, and also propose a new metric based on the fuzzy logical model of perception (FLMP to describe the benefit provided by a synthetic animated face relative to the benefit provided by a natural face. A valid metric would allow direct comparisons accross different experiments and would give measures of the benfit of a synthetic animated face relative to a natural face (or indeed any two conditions and how this benefit varies as a function of the type of synthetic face, the test items (e.g., syllables versus sentences, different individuals, and applications.
Jepsen, Morten Løve
A better understanding of how the human auditory system represents and analyzes sounds and how hearing impairment affects such processing is of great interest for researchers in the fields of auditory neuroscience, audiology, and speech communication as well as for applications in hearing......-instrument and speech technology. In this thesis, the primary focus was on the development and evaluation of a computational model of human auditory signal-processing and perception. The model was initially designed to simulate the normal-hearing auditory system with particular focus on the nonlinear processing...... aimed at experimentally characterizing the effects of cochlear damage on listeners' auditory processing, in terms of sensitivity loss and reduced temporal and spectral resolution. The results showed that listeners with comparable audiograms can have very different estimated cochlear input...
Massaro, Dominic W
I review 2 seminal research reports published in this journal during its second decade more than a century ago. Given psychology's subdisciplines, they would not normally be reviewed together because one involves reading and the other speech perception. The small amount of interaction between these domains might have limited research and theoretical progress. In fact, the 2 early research reports revealed common processes involved in these 2 forms of language processing. Their illustration of the role of Wundt's apperceptive process in reading and speech perception anticipated descriptions of contemporary theories of pattern recognition, such as the fuzzy logical model of perception. Based on the commonalities between reading and listening, one can question why they have been viewed so differently. It is commonly believed that learning to read requires formal instruction and schooling, whereas spoken language is acquired from birth onward through natural interactions with people who talk. Most researchers and educators believe that spoken language is acquired naturally from birth onward and even prenatally. Learning to read, on the other hand, is not possible until the child has acquired spoken language, reaches school age, and receives formal instruction. If an appropriate form of written text is made available early in a child's life, however, the current hypothesis is that reading will also be learned inductively and emerge naturally, with no significant negative consequences. If this proposal is true, it should soon be possible to create an interactive system, Technology Assisted Reading Acquisition, to allow children to acquire literacy naturally.
Sereno, Joan; McCall, Joyce; Jongman, Allard; Dijkstra, Ton; van Heuven, Walter
The current study investigates the effect of phonetic inventory on perception of foreign-accented speech. The perception of native English speech was compared to the perception of foreign-accented English (Dutch-accented English), with selection of stimuli determined on the basis of phonetic inventory. Half of the stimuli contained phonemes that are unique to English and do not occur in Dutch (e.g., [θ] and [æ]), and the other half contained only phonemes that are similar in both English and Dutch (e.g., [s], [i]). Both word and nonword stimuli were included to investigate the role of lexical status. A native speaker of English and a native speaker of Dutch recorded all stimuli. Stimuli were then presented to 40 American listeners using a randomized blocked design in a lexical decision experiment. Results reveal an interaction between speaker (native English versus native Dutch) and phonetic inventory (unique versus common phonemes). Specifically, Dutch-accented stimuli with common phonemes were recognized faster and more accurately than Dutch-accented stimuli with unique phonemes. Results will be discussed in terms of the influence of foreign accent on word recognition processes.
Möttönen, Riikka; Sams, Mikko
Information about the objects and events in the external world is received via multiple sense organs, especially via eyes and ears. For example, a singing bird can be heard and seen. Typically, audiovisual objects are detected, localized and identified more rapidly and accurately than objects which are perceived via only one sensory system (see, e.g. Welch and Warren, 1986; Stein and Meredith, 1993; de Gelder and Bertelson, 2003; Calvert et al., 2004). The ability of the central nervous system to utilize sensory inputs mediated by different sense organs is called multisensory processing.
Callan, Daniel E; Jones, Jeffery A; Callan, Akiko
Behavioral and neuroimaging studies have demonstrated that brain regions involved with speech production also support speech perception, especially under degraded conditions. The premotor cortex (PMC) has been shown to be active during both observation and execution of action ("Mirror System" properties), and may facilitate speech perception by mapping unimodal and multimodal sensory features onto articulatory speech gestures. For this functional magnetic resonance imaging (fMRI) study, participants identified vowels produced by a speaker in audio-visual (saw the speaker's articulating face and heard her voice), visual only (only saw the speaker's articulating face), and audio only (only heard the speaker's voice) conditions with varying audio signal-to-noise ratios in order to determine the regions of the PMC involved with multisensory and modality specific processing of visual speech gestures. The task was designed so that identification could be made with a high level of accuracy from visual only stimuli to control for task difficulty and differences in intelligibility. The results of the functional magnetic resonance imaging (fMRI) analysis for visual only and audio-visual conditions showed overlapping activity in inferior frontal gyrus and PMC. The left ventral inferior premotor cortex (PMvi) showed properties of multimodal (audio-visual) enhancement with a degraded auditory signal. The left inferior parietal lobule and right cerebellum also showed these properties. The left ventral superior and dorsal premotor cortex (PMvs/PMd) did not show this multisensory enhancement effect, but there was greater activity for the visual only over audio-visual conditions in these areas. The results suggest that the inferior regions of the ventral premotor cortex are involved with integrating multisensory information, whereas, more superior and dorsal regions of the PMC are involved with mapping unimodal (in this case visual) sensory features of the speech signal with
Full Text Available Behavioral and neuroimaging studies have demonstrated that brain regions involved with speech production also support speech perception, especially under degraded conditions. The premotor cortex has been shown to be active during both observation and execution of action (‘Mirror System’ properties, and may facilitate speech perception by mapping unimodal and multimodal sensory features onto articulatory speech gestures. For this functional magnetic resonance imaging (fMRI study, participants identified vowels produced by a speaker in audio-visual (saw the speaker’s articulating face and heard her voice, visual only (only saw the speaker’s articulating face, and audio only (only heard the speaker’s voice conditions with varying audio signal-to-noise ratios in order to determine the regions of the premotor cortex involved with multisensory and modality specific processing of visual speech gestures. The task was designed so that identification could be made with a high level of accuracy from visual only stimuli to control for task difficulty and differences in intelligibility. The results of the fMRI analysis for visual only and audio-visual conditions showed overlapping activity in inferior frontal gyrus and premotor cortex. The left ventral inferior premotor cortex showed properties of multimodal (audio-visual enhancement with a degraded auditory signal. The left inferior parietal lobule and right cerebellum also showed these properties. The left ventral superior and dorsal premotor cortex did not show this multisensory enhancement effect, but there was greater activity for the visual only over audio-visual conditions in these areas. The results suggest that the inferior regions of the ventral premotor cortex are involved with integrating multisensory information, whereas, more superior and dorsal regions of the premotor cortex are involved with mapping unimodal (in this case visual sensory features of the speech signal with
Full Text Available Brain-computer interfaces (BCIs are systems that use real-time analysis of neuroimaging data to determine the mental state of their user for purposes such as providing neurofeedback. Here, we investigate the feasibility of a BCI based on speech perception. Multivariate pattern classification methods were applied to single-trial EEG data collected during speech perception by native and non-native speakers. Two principal questions were asked: 1 Can differences in the perceived categories of pairs of phonemes be decoded at the single-trial level? 2 Can these same categorical differences be decoded across participants, within or between native-language groups? Results indicated that classification performance progressively increased with respect to the categorical status (within, boundary or across of the stimulus contrast, and was also influenced by the native language of individual participants. Classifier performance showed strong relationships with traditional event-related potential measures and behavioral responses. The results of the cross-participant analysis indicated an overall increase in average classifier performance when trained on data from all participants (native and non-native. A second cross-participant classifier trained only on data from native speakers led to an overall improvement in performance for native speakers, but a reduction in performance for non-native speakers. We also found that the native language of a given participant could be decoded on the basis of EEG data with accuracy above 80%. These results indicate that electrophysiological responses underlying speech perception can be decoded at the single-trial level, and that decoding performance systematically reflects graded changes in the responses related to the phonological status of the stimuli. This approach could be used in extensions of the BCI paradigm to support perceptual learning during second language acquisition.
Heather Raye Dial
When the lexical and sublexical stimuli were matched in discriminability, scores were highly correlated and no individual demonstrated substantially better performance on lexical than sublexical perception (Figures 1a-c. However, when the word discriminations were easier (as in prior studies; e.g., Miceli et al., 1980, patients with impaired syllable discrimination were within the control range on word discrimination (Figure 1d. Finally, digit matching showed no significant relation to perception tasks (e.g., Figure 1e. Moreover, there was a wide range of digit matching spans for patients performing well on speech perception tasks (e.g., > 1.5 on syllable discrimination and digit matching ranging from 3.6 to 6.0. These data fail to support dual route claims, suggesting that lexical processing depends on sublexical perception and suggesting that phonological STM depends on a buffer separate from speech perception mechanisms.
The work presented in this book focuses on modeling audiovisual quality as perceived by the users of IP-based solutions for video communication like videotelephony. It also extends the current framework for the parametric prediction of audiovisual call quality. The book addresses several aspects related to the quality perception of entire video calls, namely, the quality estimation of the single audio and video modalities in an interactive context, the audiovisual quality integration of these modalities and the temporal pooling of short sample-based quality scores to account for the perceptual quality impact of time-varying degradations.
Jenson, David; Harkrider, Ashley W; Thornton, David; Bowers, Andrew L; Saltuklaroglu, Tim
Sensorimotor integration (SMI) across the dorsal stream enables online monitoring of speech. Jenson et al. (2014) used independent component analysis (ICA) and event related spectral perturbation (ERSP) analysis of electroencephalography (EEG) data to describe anterior sensorimotor (e.g., premotor cortex, PMC) activity during speech perception and production. The purpose of the current study was to identify and temporally map neural activity from posterior (i.e., auditory) regions of the dorsal stream in the same tasks. Perception tasks required "active" discrimination of syllable pairs (/ba/ and /da/) in quiet and noisy conditions. Production conditions required overt production of syllable pairs and nouns. ICA performed on concatenated raw 68 channel EEG data from all tasks identified bilateral "auditory" alpha (α) components in 15 of 29 participants localized to pSTG (left) and pMTG (right). ERSP analyses were performed to reveal fluctuations in the spectral power of the α rhythm clusters across time. Production conditions were characterized by significant α event related synchronization (ERS; pFDR covert replay, and that covert replay provides one source of the motor activity previously observed in some speech perception tasks. To our knowledge, this is the first time that inhibition of auditory regions by speech has been observed in real-time with the ICA/ERSP technique.
Gervain, Judit; Mehler, Jacques
During the first year of life, infants pass important milestones in language development. We review some of the experimental evidence concerning these milestones in the domains of speech perception, phonological development, word learning, morphosyntactic acquisition, and bilingualism, emphasizing their interactions. We discuss them in the context of their biological underpinnings, introducing the most recent advances not only in language development, but also in neighboring areas such as genetics and the comparative research on animal communication systems. We argue for a theory of language acquisition that integrates behavioral, cognitive, neural, and evolutionary considerations and proposes to unify previously opposing theoretical stances, such as statistical learning, rule-based nativist accounts, and perceptual learning theories.
Vroomen, Jean; Stekelenburg, Jeroen J
The neural activity of speech sound processing (the N1 component of the auditory ERP) can be suppressed if a speech sound is accompanied by concordant lip movements. Here we demonstrate that this audiovisual interaction is neither speech specific nor linked to humanlike actions but can be observed with artificial stimuli if their timing is made predictable. In Experiment 1, a pure tone synchronized with a deformation of a rectangle induced a smaller auditory N1 than auditory-only presentations if the temporal occurrence of this audiovisual event was made predictable by two moving disks that touched the rectangle. Local autoregressive average source estimation indicated that this audiovisual interaction may be related to integrative processing in auditory areas. When the moving disks did not precede the audiovisual stimulus--making the onset unpredictable--there was no N1 reduction. In Experiment 2, the predictability of the leading visual signal was manipulated by introducing a temporal asynchrony between the audiovisual event and the collision of moving disks. Audiovisual events occurred either at the moment, before (too "early"), or after (too "late") the disks collided on the rectangle. When asynchronies varied from trial to trial--rendering the moving disks unreliable temporal predictors of the audiovisual event--the N1 reduction was abolished. These results demonstrate that the N1 suppression is induced by visual information that both precedes and reliably predicts audiovisual onset, without a necessary link to human action-related neural mechanisms.
Han, Yi; Wang, Guoyin; Yang, Yong; He, Kun
Human emotions could be expressed by many bio-symbols. Speech and facial expression are two of them. They are both regarded as emotional information which is playing an important role in human-computer interaction. Based on our previous studies on emotion recognition, an audiovisual emotion recognition system is developed and represented in this paper. The system is designed for real-time practice, and is guaranteed by some integrated modules. These modules include speech enhancement for eliminating noises, rapid face detection for locating face from background image, example based shape learning for facial feature alignment, and optical flow based tracking algorithm for facial feature tracking. It is known that irrelevant features and high dimensionality of the data can hurt the performance of classifier. Rough set-based feature selection is a good method for dimension reduction. So 13 speech features out of 37 ones and 10 facial features out of 33 ones are selected to represent emotional information, and 52 audiovisual features are selected due to the synchronization when speech and video fused together. The experiment results have demonstrated that this system performs well in real-time practice and has high recognition rate. Our results also show that the work in multimodules fused recognition will become the trend of emotion recognition in the future.
This study investigated the knowledge and attitude of primary school teachers regarding the impact of poor classroom acoustics on learners' speech perception and learning in class. Classrooms with excessive background noise and reflective surfaces could be a barrier to learning, and it is important that teachers are aware of this. There is currently limited research data about teachers' knowledge regarding the topic of classroom acoustics. Seventy teachers from three Johannesburg primary schools participated in this study. A survey by way of structured self-administered questionnaire was the primary data collection method. The findings of this study showed that most of the participants in this study did not have adequate knowledge of classroom acoustics. Most of the participants were also unaware of the impact that classrooms with poor acoustic environments can have on speech perception and learning. These results are discussed in relation to the practical implication of empowering teachers to manage the acoustic environment of their classrooms, limitations of the study as well as implications for future research.
Wang, Hsiao-Lan S; Chen, I-Chen; Chiang, Chun-Han; Lai, Ying-Hui; Tsao, Yu
The current study examined the associations between basic auditory perception, speech prosodic processing, and vocabulary development in Chinese kindergartners, specifically, whether early basic auditory perception may be related to linguistic prosodic processing in Chinese Mandarin vocabulary acquisition. A series of language, auditory, and linguistic prosodic tests were given to 100 preschool children who had not yet learned how to read Chinese characters. The results suggested that lexical tone sensitivity and intonation production were significantly correlated with children's general vocabulary abilities. In particular, tone awareness was associated with comprehensive language development, whereas intonation production was associated with both comprehensive and expressive language development. Regression analyses revealed that tone sensitivity accounted for 36% of the unique variance in vocabulary development, whereas intonation production accounted for 6% of the variance in vocabulary development. Moreover, auditory frequency discrimination was significantly correlated with lexical tone sensitivity, syllable duration discrimination, and intonation production in Mandarin Chinese. Also it provided significant contributions to tone sensitivity and intonation production. Auditory frequency discrimination may indirectly affect early vocabulary development through Chinese speech prosody.
Trébuchon, Agnès; Démonet, Jean-François; Chauvel, Patrick; Liégeois-Chauvel, Catherine
Recent theory of physiology of language suggests a dual stream dorsal/ventral organization of speech perception. Using intra-cerebral Event-related potentials (ERPs) during pre-surgical assessment of twelve drug-resistant epileptic patients, we aimed to single out electrophysiological patterns during both lexical-semantic and phonological monitoring tasks involving ventral and dorsal regions respectively. Phonological information processing predominantly occurred in the left supra-marginal gyrus (dorsal stream) and lexico-semantic information occurred in anterior/middle temporal and fusiform gyri (ventral stream). Similar latencies were identified in response to phonological and lexico-semantic tasks, suggesting parallel processing. Typical ERP components were strongly left lateralized since no evoked responses were recorded in homologous right structures. Finally, ERP patterns suggested the inferior frontal gyrus as the likely final common pathway of both dorsal and ventral streams. These results brought out detailed evidence of the spatial-temporal information processing in the dual pathways involved in speech perception.
Full Text Available Spoken words are highly variable. A single word may never be uttered the same way twice. As listeners, we regularly encounter speakers of different ages, genders, and accents, increasing the amount of variation we face. How listeners understand spoken words as quickly and adeptly as they do despite this variation remains an issue central to linguistic theory. We propose that learned acoustic patterns are mapped simultaneously to linguistic representations and to social representations. In doing so, we illuminate a paradox that results in the literature from, we argue, the focus on representations and the peripheral treatment of word-level phonetic variation. We consider phonetic variation more fully and highlight a growing body of work that is problematic for current theory: Words with different pronunciation variants are recognized equally well in immediate processing tasks, while an atypical, infrequent, but socially-idealized form is remembered better in the long-term. We suggest that the perception of spoken words is socially-weighted, resulting in sparse, but high-resolution clusters of socially-idealized episodes that are robust in immediate processing and are more strongly encoded, predicting memory inequality. Our proposal includes a dual-route approach to speech perception in which listeners map acoustic patterns in speech to linguistic and social representations in tandem. This approach makes novel predictions about the extraction of information from the speech signal, and provides a framework with which we can ask new questions. We propose that language comprehension, broadly, results from the integration of both linguistic and social information.
Zaar, Johannes; Dau, Torsten
to the considered sources of variability using a measure of the perceptual distance between responses. The largest effect was found across different CVs. For stimuli of the same phonetic identity, the speechinduced variability across and within talkers and the acrosslistener variability were substantial...
Julio Montero Díaz
Full Text Available This article analyzes the possibilities of presenting an audiovisual history in a society in which audiovisual media has progressively gained greater protagonism. We analyze specific cases of films and historical documentaries and we assess the difficulties faced by historians to understand the keys of audiovisual language and by filmmakers to understand and incorporate history into their productions. We conclude that it would not be possible to disseminate history in the western world without audiovisual resources circulated through various types of screens (cinema, television, computer, mobile phone, video games.
Full Text Available Foreign-accented speech often presents a challenging listening condition. In addition to deviations from the target speech norms related to the inexperience of the nonnative speaker, listener characteristics may play a role in determining intelligibility levels. We have previously shown that an implicit visual bias for associating East Asian faces and foreignness predicts the listeners’ perceptual ability to process Korean-accented English audiovisual speech (Yi et al., 2013. Here, we examine the neural mechanism underlying the influence of listener bias to foreign faces on speech perception. In a functional magnetic resonance imaging (fMRI study, native English speakers listened to native- and Korean-accented English sentences, with or without faces. The participants’ Asian-foreign association was measured using an implicit association test (IAT, conducted outside the scanner. We found that foreign-accented speech evoked greater activity in the bilateral primary auditory cortices and the inferior frontal gyri, potentially reflecting greater computational demand. Higher IAT scores, indicating greater bias, were associated with increased BOLD response to foreign-accented speech with faces in the primary auditory cortex, the early node for spectrotemporal analysis. We conclude the following: (1 foreign-accented speech perception places greater demand on the neural systems underlying speech perception; (2 face of the talker can exaggerate the perceived foreignness of foreign-accented speech; (3 implicit Asian-foreign association is associated with decreased neural efficiency in early spectrotemporal processing.
Weir, Kristy A.
Speech pathology students readily identify the importance of a sound understanding of anatomical structures central to their intended profession. In contrast, they often do not recognize the relevance of a broader understanding of structure and function. This study aimed to explore students' perceptions of the relevance of anatomy to speech…
Most, Tova; Rothem, Hilla; Luntz, Michal
The researchers evaluated the contribution of cochlear implants (CIs) to speech perception by a sample of prelingually deaf individuals implanted after age 8 years. This group was compared with a group with profound hearing impairment (HA-P), and with a group with severe hearing impairment (HA-S), both of which used hearing aids. Words and…
Zhang, Juan; McBride-Chang, Catherine
A 4-stage developmental model, in which auditory sensitivity is fully mediated by speech perception at both the segmental and suprasegmental levels, which are further related to word reading through their associations with phonological awareness, rapid automatized naming, verbal short-term memory and morphological awareness, was tested with…
Scott, Sophie K.
Our understanding of the neurobiological basis for human speech production and perception has benefited from insights from psychology, neuropsychology and neurology. In this overview, I outline some of the ways that functional imaging has added to this knowledge and argue that, as a neuroanatomical tool, functional imaging has led to some…
Isabel Fernandes Silva
Full Text Available Over the last decades, audiovisual translation has gained increased significance in Translation Studies as well as an interdisciplinary subject within other fields (media, cinema studies etc. Although many articles have been published on communicative aspects of translation such as politeness, only recently have scholars taken an interest in the translation of compliments. This study will focus on both these areas from a multimodal and pragmatic perspective, emphasizing the links between these fields and how this multidisciplinary approach will evidence the polysemiotic nature of the translation process. In Audiovisual Translation both text and image are at play, therefore, the translation of speech produced by the characters may either omit (because it is provided by visualgestual signs or it may emphasize information. A selection was made of the compliments present in the film What Women Want, our focus being on subtitles which did not successfully convey the compliment expressed in the source text, as well as analyze the reasons for this, namely difference in register, Culture Specific Items and repetitions. These differences lead to a different portrayal/identity/perception of the main character in the English version (original soundtrack and subtitled versions in Portuguese and Italian.
Full Text Available Activity in premotor and sensorimotor cortices is found in speech production and some perception tasks. Yet, how sensorimotor integration supports these functions is unclear due to a lack of data examining the timing of activity from these regions. Beta (~20Hz and alpha (~10Hz spectral power within the EEG µ rhythm are considered indices of motor and somatosensory activity, respectively. In the current study, perception conditions required discrimination (same/different of syllables pairs (/ba/ and /da/ in quiet and noisy conditions. Production conditions required covert and overt syllable productions and overt word production. Independent component analysis was performed on EEG data obtained during these conditions to 1 identify clusters of µ components common to all conditions and 2 examine real-time event-related spectral perturbations (ERSP within alpha and beta bands. 17 and 15 out of 20 participants produced left and right µ-components, respectively, localized to precentral gyri. Discrimination conditions were characterized by significant (pFDR<.05 early alpha event-related synchronization (ERS prior to and during stimulus presentation and later alpha event-related desynchronization (ERD following stimulus offset. Beta ERD began early and gained strength across time. Differences were found between quiet and noisy discrimination conditions. Both overt syllable and word productions yielded similar alpha/beta ERD that began prior to production and was strongest during muscle activity. Findings during covert production were weaker than during overt production. One explanation for these findings is that µ-beta ERD indexes early predictive coding (e.g., internal modeling and/or overt and covert attentional / motor processes. µ-alpha ERS may index inhibitory input to the premotor cortex from sensory regions prior to and during discrimination, while µ-alpha ERD may index re-afferent sensory feedback during speech rehearsal and production.
Pottackal Mathai, Jijo; Mohammed, Hasheem
To investigate the effect of compression time settings and presentation levels on speech perception in noise for elderly individuals with hearing loss. To compare aided speech perception performance in these individuals with age-matched normal hearing subjects. Twenty (normal hearing) participants within the age range of 60-68 years and 20 (mild-to-moderate sensorineural hearing loss) in the age range of 60-70 years were randomly recruited for the study. In the former group, SNR-50 was determined using phonetically balanced sentences that were mixed with speech-shaped noise presented at the most comfortable level. In the SNHL group, aided SNR-50 was determined at three different presentation levels (40, 60, and 80 dB HL) after fitting binaural hearing aids that had different compression time settings (fast and slow). In the SNHL group, slow compression time settings showed significantly better SNR-50 compared to fast release time. In addition, the mean of SNR-50 in the SNHL group was comparable to normal hearing participants while using a slow release time. A hearing aid with slow compression time settings led to significantly better speech perception in noise, compared to that of a hearing aid that had fast compression time settings.
Full Text Available Past work has shown relationship between the ability to discriminate spectral patterns and measures of speech intelligibility. The purpose of this study was to investigate the ability of both children and young adults to discriminate static and dynamic spectral patterns, comparing performance between the two groups and evaluating within- group results in terms of relationship to speech-in-noise perception. Data were collected from normal-hearing children (age range: 5.4-12.8 years and young adults (mean age: 22.8 years on two spectral discrimination tasks and speech-in-noise perception. The first discrimination task, involving static spectral profiles, measured the ability to detect a change in the phase of a low-density sinusoidal spectral ripple of wideband noise. Using dynamic spectral patterns, the second task determined the signal-to-noise ratio needed to discriminate the temporal pattern of frequency fluctuation imposed by stochastic lowrate frequency modulation (FM. Children performed significantly poorer than young adults on both discrimination tasks. For children, a significant correlation between speech-in-noise perception and spectral- pattern discrimination was obtained only with the dynamic patterns of the FM condition, with partial correlation suggesting that factors related to the childrenâ€™s age mediated the relationship.
Gjaja, Marin N.
Neural networks for supervised and unsupervised learning are developed and applied to problems in remote sensing, continuous map learning, and speech perception. Adaptive Resonance Theory (ART) models are real-time neural networks for category learning, pattern recognition, and prediction. Unsupervised fuzzy ART networks synthesize fuzzy logic and neural networks, and supervised ARTMAP networks incorporate ART modules for prediction and classification. New ART and ARTMAP methods resulting from analyses of data structure, parameter specification, and category selection are developed. Architectural modifications providing flexibility for a variety of applications are also introduced and explored. A new methodology for automatic mapping from Landsat Thematic Mapper (TM) and terrain data, based on fuzzy ARTMAP, is developed. System capabilities are tested on a challenging remote sensing problem, prediction of vegetation classes in the Cleveland National Forest from spectral and terrain features. After training at the pixel level, performance is tested at the stand level, using sites not seen during training. Results are compared to those of maximum likelihood classifiers, back propagation neural networks, and K-nearest neighbor algorithms. Best performance is obtained using a hybrid system based on a convex combination of fuzzy ARTMAP and maximum likelihood predictions. This work forms the foundation for additional studies exploring fuzzy ARTMAP's capability to estimate class mixture composition for non-homogeneous sites. Exploratory simulations apply ARTMAP to the problem of learning continuous multidimensional mappings. A novel system architecture retains basic ARTMAP properties of incremental and fast learning in an on-line setting while adding components to solve this class of problems. The perceptual magnet effect is a language-specific phenomenon arising early in infant speech development that is characterized by a warping of speech sound perception. An
Hertrich, Ingo; Kirsten, Mareike; Tiemann, Sonja; Beck, Sigrid; Wühle, Anja; Ackermann, Hermann; Rolke, Bettina
Discourse structure enables us to generate expectations based upon linguistic material that has already been introduced. The present magnetoencephalography (MEG) study addresses auditory perception of test sentences in which discourse coherence was manipulated by using presuppositions (PSP) that either correspond or fail to correspond to items in preceding context sentences with respect to uniqueness and existence. Context violations yielded delayed auditory M50 and enhanced auditory M200 cross-correlation responses to syllable onsets within an analysis window of 1.5s following the PSP trigger words. Furthermore, discourse incoherence yielded suppression of spectral power within an expanded alpha band ranging from 6 to 16Hz. This effect showed a bimodal temporal distribution, being significant in an early time window of 0.0-0.5s following the PSP trigger and a late interval of 2.0-2.5s. These findings indicate anticipatory top-down mechanisms interacting with various aspects of bottom-up processing during speech perception.
Jane, Griselda; Tunjungsari, Harini
Parental involvement in a speech therapy has not been prioritized in most therapy centers in Indonesia. One of the therapy centers that has recognized the importance of parental involvement is Kailila Speech Therapy Center. In Kailila speech therapy center, parental involvement in children's speech therapy is an obligation that has been…
Full Text Available This study investigated whether age and/or differences in hearing sensitivity influence the perception of the emotion dimensions arousal (calm vs. aroused and valence (positive vs. negative attitude in conversational speech. To that end, this study specifically focused on the relationship between participants’ ratings of short affective utterances and the utterances’ acoustic parameters (pitch, intensity, and articulation rate known to be associated with the emotion dimensions arousal and valence. Stimuli consisted of short utterances taken from a corpus of conversational speech. In two rating tasks, younger and older adults either rated arousal or valence using a 5-point scale. Mean intensity was found to be the main cue participants used in the arousal task (i.e., higher mean intensity cueing higher levels of arousal while mean F0 was the main cue in the valence task (i.e., higher mean F0 being interpreted as more negative. Even though there were no overall age group differences in arousal or valence ratings, compared to younger adults, older adults responded less strongly to mean intensity differences cueing arousal and responded more strongly to differences in mean F0 cueing valence. Individual hearing sensitivity among the older adults did not modify the use of mean intensity as an arousal cue. However, individual hearing sensitivity generally affected valence ratings and modified the use of mean F0. We conclude that age differences in the interpretation of mean F0 as a cue for valence are likely due to age-related hearing loss, whereas age differences in rating arousal do not seem to be driven by hearing sensitivity differences between age groups (as measured by pure-tone audiometry.
Lassen, N A; Friberg, L
Specific types of brain activity as sensory perception auditory, somato-sensory or visual -or the performance of movements are accompanied by increases of blood flow and oxygen consumption in the cortical areas involved with performing the respective tasks. The activation patterns observed...... by measuring regional cerebral blood flow CBF after intracarotid Xenon-133 injection are reviewed with emphasis on tests involving auditory perception and speech, and approach allowing to visualize Wernicke and Broca's areas and their contralateral homologues in vivo. The completely atraumatic tomographic CBF...
Full Text Available Abstract Background How does the brain repair obliterated speech and cope with acoustically ambivalent situations? A widely discussed possibility is to use top-down information for solving the ambiguity problem. In the case of speech, this may lead to a match of bottom-up sensory input with lexical expectations resulting in resonant states which are reflected in the induced gamma-band activity (GBA. Methods In the present EEG study, we compared the subject's pre-attentive GBA responses to obliterated speech segments presented after a series of correct words. The words were a minimal pair in German and differed with respect to the degree of specificity of segmental phonological information. Results The induced GBA was larger when the expected lexical information was phonologically fully specified compared to the underspecified condition. Thus, the degree of specificity of phonological information in the mental lexicon correlates with the intensity of the matching process of bottom-up sensory input with lexical information. Conclusions These results together with those of a behavioural control experiment support the notion of multi-level mechanisms involved in the repair of deficient speech. The delineated alignment of pre-existing knowledge with sensory input is in accordance with recent ideas about the role of internal forward models in speech perception.
Ramirez, Joshua; Mann, Virginia
Both dyslexics and auditory neuropathy (AN) subjects show inferior consonant-vowel (CV) perception in noise, relative to controls. To better understand these impairments, natural acoustic speech stimuli that were masked in speech-shaped noise at various intensities were presented to dyslexic, AN, and control subjects either in isolation or accompanied by visual articulatory cues. AN subjects were expected to benefit from the pairing of visual articulatory cues and auditory CV stimuli, provided that their speech perception impairment reflects a relatively peripheral auditory disorder. Assuming that dyslexia reflects a general impairment of speech processing rather than a disorder of audition, dyslexics were not expected to similarly benefit from an introduction of visual articulatory cues. The results revealed an increased effect of noise masking on the perception of isolated acoustic stimuli by both dyslexic and AN subjects. More importantly, dyslexics showed less effective use of visual articulatory cues in identifying masked speech stimuli and lower visual baseline performance relative to AN subjects and controls. Last, a significant positive correlation was found between reading ability and the ameliorating effect of visual articulatory cues on speech perception in noise. These results suggest that some reading impairments may stem from a central deficit of speech processing.
Drullman, R.; Bronkhorst, A.W.
Speech intelligibility was investigated by varying the number of interfering talkers, level, and mean pitch differences between target and interfering speech, and the presence of tactile support. In a first experiment the speech-reception threshold (SRT) for sentences was measured for a male talker
Chan, Yu Man; Pianta, Michael J; McKendrick, Allison M
Perceived synchrony of visual and auditory signals can be altered by exposure to a stream of temporally offset stimulus pairs. Previous literature suggests that adapting to audiovisual temporal offsets is an important recalibration to correctly combine audiovisual stimuli into a single percept across a range of source distances. Healthy aging results in synchrony perception over a wider range of temporally offset visual and auditory signals, independent of age-related unisensory declines in vision and hearing sensitivities. However, the impact of aging on audiovisual recalibration is unknown. Audiovisual synchrony perception for sound-lead and sound-lag stimuli was measured for 15 younger (22-32 years old) and 15 older (64-74 years old) healthy adults using a method-of-constant-stimuli, after adapting to a stream of visual and auditory pairs. The adaptation pairs were either synchronous or asynchronous (sound-lag of 230 ms). The adaptation effect for each observer was computed as the shift in the mean of the individually fitted psychometric functions after adapting to asynchrony. Post-adaptation to synchrony, the younger and older observers had average window widths (±standard deviation) of 326 (±80) and 448 (±105) ms, respectively. There was no adaptation effect for sound-lead pairs. Both the younger and older observers, however, perceived more sound-lag pairs as synchronous. The magnitude of the adaptation effect in the older observers was not correlated with how often they saw the adapting sound-lag stimuli as asynchronous. Our finding demonstrates that audiovisual synchrony perception adapts less with advancing age.
Yu Man eChan
Full Text Available Perceived synchrony of visual and auditory signals can be altered by exposure to a stream of temporally offset stimulus pairs. Previous literature suggests that adapting to audiovisual temporal offsets is an important recalibration to correctly combine audiovisual stimuli into a single percept across a range of source distances. Healthy ageing results in synchrony perception over a wider range of temporally offset visual and auditory signals, independent of age-related unisensory declines in vision and hearing sensitivities. However the impact of ageing on audiovisual recalibration is unkonwn. Audiovisual synchrony perception for sound-lead and sound-lag stimuli was measured for fifteen younger (22-32 years old and fifteen older (64-74 years old healthy adults using a method-of-constant-stimuli, after adapting to a stream of visual and auditory pairs. The adaptation pairs were either synchronous or asynchronous (sound-lag of 230ms. The adaptation effect for each observer was computed as the shift in the mean of the individually fitted psychometric functions after adapting to asynchrony. Post adaptation to synchrony, the younger and older observers had average window widths (±standard deviation of 326 (±80 and 448 (±105 ms, respectively. There was no adaptation effect for sound-lead pairs. Both the younger and older observers however perceived more sound-lag pairs as synchronous. The magnitude of the adaptation effect in the older observers was not correlated with how often they saw the adapting sound-lag stimuli as asynchronous nor their synchrony window widths. Our finding demonstrates that audiovisual synchrony perception adapts less with advancing age.
Tan, Jye-Sheng; Yeh, Su-Ling
Meanings of masked complex scenes can be extracted without awareness; however, it remains unknown whether audiovisual integration occurs with an invisible complex visual scene. The authors examine whether a scenery soundtrack can facilitate unconscious processing of a subliminal visual scene. The continuous flash suppression paradigm was used to render a complex scene picture invisible, and the picture was paired with a semantically congruent or incongruent scenery soundtrack. Participants were asked to respond as quickly as possible if they detected any part of the scene. Release-from-suppression time was used as an index of unconscious processing of the complex scene, which was shorter in the audiovisual congruent condition than in the incongruent condition (Experiment 1). The possibility that participants adopted different detection criteria for the 2 conditions was excluded (Experiment 2). The audiovisual congruency effect did not occur for objects-only (Experiment 3) and background-only (Experiment 4) pictures, and it did not result from consciously mediated conceptual priming (Experiment 5). The congruency effect was replicated when catch trials without scene pictures were added to exclude participants with high false-alarm rates (Experiment 6). This is the first study demonstrating unconscious audiovisual integration with subliminal scene pictures, and it suggests expansions of scene-perception theories to include unconscious audiovisual integration.
Getzmann, Stephan; Wascher, Edmund
Speech understanding in the presence of concurring sound is a major challenge especially for older persons. In particular, conversational turn-takings usually result in switch costs, as indicated by declined speech perception after changes in the relevant target talker. Here, we investigated whether visual cues indicating the future position of a target talker may reduce the costs of switching in younger and older adults. We employed a speech perception task, in which sequences of short words were simultaneously presented by three talkers, and analysed behavioural measures and event-related potentials (ERPs). Informative cues resulted in increased performance after a spatial change in target talker compared to uninformative cues, not indicating the future target position. Especially the older participants benefited from knowing the future target position in advance, indicated by reduced response times after informative cues. The ERP analysis revealed an overall reduced N2, and a reduced P3b to changes in the target talker location in older participants, suggesting reduced inhibitory control and context updating. On the other hand, a pronounced frontal late positive complex (f-LPC) to the informative cues indicated increased allocation of attentional resources to changes in target talker in the older group, in line with the decline-compensation hypothesis. Thus, knowing where to listen has the potential to compensate for age-related decline in attentional switching in a highly variable cocktail-party environment.
采用启动范式,以汉语听者为被试,考察了非言语声音是否影响言语声音的知觉.实验1考察了纯音对辅音范畴连续体知觉的影响,结果发现纯音影响到辅音范畴连续体的知觉,表现出频谱对比效应.实验2考察了纯音和复合音对元音知觉的影响,结果发现与元音共振峰频率一致的纯音或复合音加快了元音的识别,表现出启动效应.两个实验一致发现非言语声音能够影响言语声音的知觉,表明言语声音知觉也需要一个前言语的频谱特征分析阶段,这与言语知觉听觉理论的观点一致.%A long-standing debate in the field of speech perception concerns whether specialized processing mechanisms are necessary to perceive speech sounds. The motor theory argues that speech perception is a special process and non-speech sounds don't affect the perception of speech sounds. The auditory theory suggests that speech perception can be understood in terms of general auditory process, which is shared with the perception of non-speech sounds. The findings from English subjects indicate that the processing of non-speech sounds affects the perception of speech sounds. Few studies have been administered in Chinese. The present study administered two experiments to examine whether the processing of non-speech sounds could affect the perception of speech segments in Chinese listeners. In experiment 1, speech sounds were a continuum of synthesized consonant category ranging from /ba/ to /da/. Non-speech sounds were two sine wave tones, with frequency equal to the onset frequency of F2 of/ba/ and /da/, respectively. Following the two tones, the /ba/-/da/ series were presented with a 50ms ISI. Undergraduate participants were asked to identify the speech sounds. The results found that non-speech tones influenced identification of speech targets: when the frequency of tone was equal to F2 onset frequency of /ba/, participants were more likely to identify consonant
Noel, Jean-Paul; De Niear, Matthew; Van der Burg, Erik; Wallace, Mark T
Multisensory interactions are well established to convey an array of perceptual and behavioral benefits. One of the key features of multisensory interactions is the temporal structure of the stimuli combined. In an effort to better characterize how temporal factors influence multisensory interactions across the lifespan, we examined audiovisual simultaneity judgment and the degree of rapid recalibration to paired audiovisual stimuli (Flash-Beep and Speech) in a sample of 220 participants ranging from 7 to 86 years of age. Results demonstrate a surprisingly protracted developmental time-course for both audiovisual simultaneity judgment and rapid recalibration, with neither reaching maturity until well into adolescence. Interestingly, correlational analyses revealed that audiovisual simultaneity judgments (i.e., the size of the audiovisual temporal window of simultaneity) and rapid recalibration significantly co-varied as a function of age. Together, our results represent the most complete description of age-related changes in audiovisual simultaneity judgments to date, as well as being the first to describe changes in the degree of rapid recalibration as a function of age. We propose that the developmental time-course of rapid recalibration scaffolds the maturation of more durable audiovisual temporal representations.
De Niear, Matthew; Van der Burg, Erik; Wallace, Mark T.
Multisensory interactions are well established to convey an array of perceptual and behavioral benefits. One of the key features of multisensory interactions is the temporal structure of the stimuli combined. In an effort to better characterize how temporal factors influence multisensory interactions across the lifespan, we examined audiovisual simultaneity judgment and the degree of rapid recalibration to paired audiovisual stimuli (Flash-Beep and Speech) in a sample of 220 participants ranging from 7 to 86 years of age. Results demonstrate a surprisingly protracted developmental time-course for both audiovisual simultaneity judgment and rapid recalibration, with neither reaching maturity until well into adolescence. Interestingly, correlational analyses revealed that audiovisual simultaneity judgments (i.e., the size of the audiovisual temporal window of simultaneity) and rapid recalibration significantly co-varied as a function of age. Together, our results represent the most complete description of age-related changes in audiovisual simultaneity judgments to date, as well as being the first to describe changes in the degree of rapid recalibration as a function of age. We propose that the developmental time-course of rapid recalibration scaffolds the maturation of more durable audiovisual temporal representations. PMID:27551918
Baese-Berk, Melissa M; Dilley, Laura C; Schmidt, Stephanie; Morrill, Tuuli H; Pitt, Mark A
Neil Armstrong insisted that his quote upon landing on the moon was misheard, and that he had said one small step for a man, instead of one small step for man. What he said is unclear in part because function words like a can be reduced and spectrally indistinguishable from the preceding context. Therefore, their presence can be ambiguous, and they may disappear perceptually depending on the rate of surrounding speech. Two experiments are presented examining production and perception of reduced tokens of for and for a in spontaneous speech. Experiment 1 investigates the distributions of several acoustic features of for and for a. The results suggest that the distributions of for and for a overlap substantially, both in terms of temporal and spectral characteristics. Experiment 2 examines perception of these same tokens when the context speaking rate differs. The perceptibility of the function word a varies as a function of this context speaking rate. These results demonstrate that substantial ambiguity exists in the original quote from Armstrong, and that this ambiguity may be understood through context speaking rate.
Burnham, Denis; Dodd, Barbara
The McGurk effect, in which auditory [ba] dubbed onto [ga] lip movements is perceived as "da" or "tha," was employed in a real-time task to investigate auditory-visual speech perception in prelingual infants. Experiments 1A and 1B established the validity of real-time dubbing for producing the effect. In Experiment 2, 4 1/2-month-olds were tested in a habituation-test paradigm, in which an auditory-visual stimulus was presented contingent upon visual fixation of a live face. The experimental group was habituated to a McGurk stimulus (auditory [ba] visual [ga]), and the control group to matching auditory-visual [ba]. Each group was then presented with three auditory-only test trials, [ba], [da], and [(delta)a] (as in then). Visual-fixation durations in test trials showed that the experimental group treated the emergent percept in the McGurk effect, [da] or [(delta)a], as familiar (even though they had not heard these sounds previously) and [ba] as novel. For control group infants [da] and [(delta)a] were no more familiar than [ba]. These results are consistent with infants' perception of the McGurk effect, and support the conclusion that prelinguistic infants integrate auditory and visual speech information.
Baese-Berk, Melissa M.; Dilley, Laura C.; Schmidt, Stephanie; Morrill, Tuuli H.; Pitt, Mark A.
Neil Armstrong insisted that his quote upon landing on the moon was misheard, and that he had said one small step for a man, instead of one small step for man. What he said is unclear in part because function words like a can be reduced and spectrally indistinguishable from the preceding context. Therefore, their presence can be ambiguous, and they may disappear perceptually depending on the rate of surrounding speech. Two experiments are presented examining production and perception of reduced tokens of for and for a in spontaneous speech. Experiment 1 investigates the distributions of several acoustic features of for and for a. The results suggest that the distributions of for and for a overlap substantially, both in terms of temporal and spectral characteristics. Experiment 2 examines perception of these same tokens when the context speaking rate differs. The perceptibility of the function word a varies as a function of this context speaking rate. These results demonstrate that substantial ambiguity exists in the original quote from Armstrong, and that this ambiguity may be understood through context speaking rate. PMID:27603209
Full Text Available Background: Upon graduation, newly qualified speech-language therapists are expected to provide services independently. This study describes new graduates’ perceptions of their preparedness to provide services across the scope of the profession and explores associations between perceptions of dysphagia theory and clinical learning curricula with preparedness for adult and paediatric dysphagia service delivery.Methods: New graduates of six South African universities were recruited to participate in a survey by completing an electronic questionnaire exploring their perceptions of the dysphagia curricula and their preparedness to practise across the scope of the profession of speechlanguage therapy. Results: Eighty graduates participated in the study yielding a response rate of 63.49%. Participants perceived themselves to be well prepared in some areas (e.g. child language: 100%; articulation and phonology: 97.26%, but less prepared in other areas (e.g. adult dysphagia: 50.70%; paediatric dysarthria: 46.58%; paediatric dysphagia: 38.36% and most unprepared to provide services requiring sign language (23.61% and African languages (20.55%. There was a significant relationship between perceptions of adequate theory and clinical learning opportunities with assessment and management of dysphagia and perceptions of preparedness to provide dysphagia services. Conclusion: There is a need for review of existing curricula and consideration of developing a standard speech-language therapy curriculum across universities, particularly in service provision to a multilingual population, and in both the theory and clinical learning of the assessment and management of adult and paediatric dysphagia, to better equip graduates for practice.
Today, huge quantities of digital audiovisual resources are already available - everywhere and at any time - through Web portals, online archives and libraries, and video blogs. One central question with respect to this huge amount of audiovisual data is how they can be used in specific (social, pedagogical, etc.) contexts and what are their potential interest for target groups (communities, professionals, students, researchers, etc.).This book examines the question of the (creative) exploitation of digital audiovisual archives from a theoretical, methodological, technical and practical
Rosner, Burton S.; Talcott, Joel B.; Witton, Caroline; Hogg, James D.; Richardson, Alexandra J.; Hansen, Peter C.; Stein, John F.
"Sine-wave speech" sentences contain only four frequency-modulated sine waves, lacking many acoustic cues present in natural speech. Adults with (n=19) and without (n=14) dyslexia were asked to reproduce orally sine-wave utterances in successive trials. Results suggest comprehension of sine-wave sentences is impaired in some adults with…
Ohms, Verena Regina
Birdsong and human speech are both complex behaviours which show striking similarities mainly thought to be present in the area of development and learning. The most important parameters in human speech are vocal tract resonances, called formants. Different formant patterns characterize different vo
Mitterer, H.; McQueen, J.M.
Understanding foreign speech is difficult, in part because of unusual mappings between sounds and words. It is known that listeners in their native language can use lexical knowledge (about how words ought to sound) to learn how to interpret unusual speech-sounds. We therefore investigated whether s
Kennedy, Sara; Blanchet, Josée
To be effective second or additional language (L2) listeners, learners should be aware of typical processes in connected L2 speech (e.g. linking). This longitudinal study explored how learners' developing ability to perceive connected L2 speech was related to the quality of their language awareness. Thirty-two learners of L2 French at a university…
Saku T. Sinkkonen
Full Text Available The temporal difference limen (TDL can be measured with noninvasive electrical ear canal stimulation. The objective of the study wa to determine the role of preoperative TDL measurements in predicting patientsâ€™ speech perception after cochlear implantation. We carried out a retrospective chart analysis of fifty-four cochlear implant (CI patients with preoperative TDL and postoperative bisyllabic word recognition measurements in Helsinki University Central Hospital between March 1994 and March 2011. Our results show that there is no correlation between TDL and postoperative speech perception. However, patientâ€™s advancing age correlates with longer TDL but notdirectly with poorer speech perception. The results are in line with previous results concerning the lack of predictive value of preoperativ TDL measurements in CI patients.
Wang, Xiaoyue; Wang, Suiping; Fan, Yuebo; Huang, Dan; Zhang, Yang
Recent studies reveal that tonal language speakers with autism have enhanced neural sensitivity to pitch changes in nonspeech stimuli but not to lexical tone contrasts in their native language. The present ERP study investigated whether the distinct pitch processing pattern for speech and nonspeech stimuli in autism was due to a speech-specific deficit in categorical perception of lexical tones. A passive oddball paradigm was adopted to examine two groups (16 in the autism group and 15 in the control group) of Chinese children’s Mismatch Responses (MMRs) to equivalent pitch deviations representing within-category and between-category differences in speech and nonspeech contexts. To further examine group-level differences in the MMRs to categorical perception of speech/nonspeech stimuli or lack thereof, neural oscillatory activities at the single trial level were further calculated with the inter-trial phase coherence (ITPC) measure for the theta and beta frequency bands. The MMR and ITPC data from the children with autism showed evidence for lack of categorical perception in the lexical tone condition. In view of the important role of lexical tones in acquiring a tonal language, the results point to the necessity of early intervention for the individuals with autism who show such a speech-specific categorical perception deficit. PMID:28225070
Kislova, O O; Rusalova, M N
The article is a review of the general concepts and approaches in research of recognition of emotions in speech: psychological concepts, principles and methods of study and physiological data in studies on animals and human. The concepts of emotional intelligence (ability to understand and recognize emotions of other people and to understand and regulate personal emotions), emotional hearing (ability to recognize emotions in speech) are discussed, general review of the paradigms is presented. The research of brain mechanisms of speech emotions differentiation is based on the study of local injuries and dysfunctions, along with the study on healthy subjects.
Kim, Jongin; Lee, Suh-Kyung; Lee, Boreom
Objective. The objective of this study is to find components that might be related to phoneme representation in the brain and to discriminate EEG responses for each speech sound on a trial basis. Approach. We used multivariate empirical mode decomposition (MEMD) and common spatial pattern for feature extraction. We chose three vowel stimuli, /a/, /i/ and /u/, based on previous findings, such that the brain can detect change in formant frequency (F2) of vowels. EEG activity was recorded from seven native Korean speakers at Gwangju Institute of Science and Technology. We applied MEMD over EEG channels to extract speech-related brain signal sources, and looked for the intrinsic mode functions which were dominant in the alpha bands. After the MEMD procedure, we applied the common spatial pattern algorithm for enhancing the classification performance, and used linear discriminant analysis (LDA) as a classifier. Main results. The brain responses to the three vowels could be classified as one of the learned phonemes on a single-trial basis with our approach. Significance. The results of our study show that brain responses to vowels can be classified for single trials using MEMD and LDA. This approach may not only become a useful tool for the brain-computer interface but it could also be used for discriminating the neural correlates of categorical speech perception.
Slater, Jessica; Skoe, Erika; Strait, Dana L; O'Connell, Samantha; Thompson, Elaine; Kraus, Nina
Music training may strengthen auditory skills that help children not only in musical performance but in everyday communication. Comparisons of musicians and non-musicians across the lifespan have provided some evidence for a "musician advantage" in understanding speech in noise, although reports have been mixed. Controlled longitudinal studies are essential to disentangle effects of training from pre-existing differences, and to determine how much music training is necessary to confer benefits. We followed a cohort of elementary school children for 2 years, assessing their ability to perceive speech in noise before and after musical training. After the initial assessment, participants were randomly assigned to one of two groups: one group began music training right away and completed 2 years of training, while the second group waited a year and then received 1 year of music training. Outcomes provide the first longitudinal evidence that speech-in-noise perception improves after 2 years of group music training. The children were enrolled in an established and successful community-based music program and followed the standard curriculum, therefore these findings provide an important link between laboratory-based research and real-world assessment of the impact of music training on everyday communication skills.
Conboy, Barbara T; Kuhl, Patricia K
Language experience 'narrows' speech perception by the end of infants' first year, reducing discrimination of non-native phoneme contrasts while improving native-contrast discrimination. Previous research showed that declines in non-native discrimination were reversed by second-language experience provided at 9-10 months, but it is not known whether second-language experience affects first-language speech sound processing. Using event-related potentials (ERPs), we examined learning-related changes in brain activity to Spanish and English phoneme contrasts in monolingual English-learning infants pre- and post-exposure to Spanish from 9.5-10.5 months of age. Infants showed a significant discriminatory ERP response to the Spanish contrast at 11 months (post-exposure), but not at 9 months (pre-exposure). The English contrast elicited an earlier discriminatory response at 11 months than at 9 months, suggesting improvement in native-language processing. The results show that infants rapidly encode new phonetic information, and that improvement in native speech processing can occur during second-language learning in infancy.
Researches on the development of learners＇ phonological competence have been done mainly from the aspects of physical prosperities of phonology and interlanguage of L2 acquisition, ignoring the effect of speech perception and production on it. Based on theories of cognition and psychology, this paper attempts to explore the prosperities and pattern of L2 oral reading speech perception. It indicates that learner is the subject of L2 oral reading speech perception, which is constrained by speech organs, cognitive ability and pattern of L1 Speech perception. In Addiction, there exist differences between L1 and I2 phonology perception, psychology perception and concept perception. L2 oral reading is essentially a physical and cognitive experience, the construction basis for the empirically cognitive teaching model.%国内已有的二语朗读研究主要从音系的物理特性和二语习得中介语的角度来探讨学习者音系发展水平，却忽略了言语感知和输出对二语朗读发展水平的作用。研究表明，学习者是二语朗读的主体，二语朗读受到发青器官、认知水平和母语感知方式的制约；二语朗读在语音感知、情感感知和概念感知方面与母语者存在差别。二语朗读的本质是生理和认知的体验性，这一特性正是二语朗读听读说叠加教学模式构建的基础。
Svirsky, Mario A; Teoh, Su-Wooi; Neuburger, Heidi
Like any other surgery requiring anesthesia, cochlear implantation in the first few years of life carries potential risks, which makes it important to assess the potential benefits. This study introduces a new method to assess the effect of age at implantation on cochlear implant outcomes: developmental trajectory analysis (DTA). DTA compares curves representing change in an outcome measure over time (i.e. developmental trajectories) for two groups of children that differ along a potentially important independent variable (e.g. age at intervention). This method was used to compare language development and speech perception outcomes in children who received cochlear implants in the second, third or fourth year of life. Within this range of age at implantation, it was found that implantation before the age of 2 resulted in speech perception and language advantages that were significant both from a statistical and a practical point of view. Additionally, the present results are consistent with the existence of a 'sensitive period' for language development, a gradual decline in language acquisition skills as a function of age.
Mitterer, Holger; McQueen, James M
Understanding foreign speech is difficult, in part because of unusual mappings between sounds and words. It is known that listeners in their native language can use lexical knowledge (about how words ought to sound) to learn how to interpret unusual speech-sounds. We therefore investigated whether subtitles, which provide lexical information, support perceptual learning about foreign speech. Dutch participants, unfamiliar with Scottish and Australian regional accents of English, watched Scottish or Australian English videos with Dutch, English or no subtitles, and then repeated audio fragments of both accents. Repetition of novel fragments was worse after Dutch-subtitle exposure but better after English-subtitle exposure. Native-language subtitles appear to create lexical interference, but foreign-language subtitles assist speech learning by indicating which words (and hence sounds) are being spoken.
Full Text Available A number of studies showed that infants reorganize their perception of speech sounds according to their native language categories during their first year of life. Still, information is lacking about the contribution of basic auditory mechanisms to this process. This study aimed to evaluate when native language experience starts to noticeably affect the perceptual processing of basic acoustic cues (i.e., frequency-modulation (FM and amplitude-modulation (AM information known to be crucial for speech perception in adults. The discrimination of a lexical-tone contrast (rising versus low was assessed in 6- and 10-month-old infants learning either French or Mandarin using a visual habituation paradigm. The lexical tones were presented in two conditions designed to either keep intact or to severely degrade the FM and fine spectral cues needed to accurately perceive voice-pitch trajectory. A third condition was designed to assess the discrimination of the same voice-pitch trajectories using click trains containing only the FM cues related to the fundamental-frequency (F0 in French- and Mandarin-learning 10-month-old infants. Results showed that the younger infants of both language groups and the Mandarin-learning 10-month-olds discriminated the intact lexical-tone contrast while French-learning 10-month-olds failed. However, only the French 10-month-olds discriminated degraded lexical tones when FM, and thus voice-pitch cues were reduced. Moreover, Mandarin-learning 10-month-olds were found to discriminate the pitch trajectories as presented in click trains better than French infants. Altogether, these results reveal that the perceptual reorganization occurring during the first year of life for lexical tones is coupled with changes in the auditory ability to use speech modulation cues.
Full Text Available The processing of fluent speech involves complex computational steps that begin with the segmentation of the continuous flow of speech sounds into syllables and words. One question that naturally arises pertains to the type of syllabic information that speech processes act upon. Here, we used functional magnetic resonance imaging to profile regions, using a combination of whole-brain and exploratory anatomical region-of-interest (ROI approaches, that were sensitive to syllabic information during speech perception by parametrically manipulating syllabic complexity along two dimensions: (1 individual syllable complexity, and (2 sequence complexity (supra-syllabic. We manipulated the complexity of the syllable by using the simplest syllable template—a consonant and vowel (CV-and inserting an additional consonant to create a complex onset (CCV. The supra-syllabic complexity was manipulated by creating sequences composed of the same syllable repeated 6 times (e.g. /pa-pa-pa-pa-pa-pa/ and sequences of 3 different syllables each repeated twice (e.g. /pa-ta-ka-pa-ta-ka/. This parametrical design allowed us to identify brain regions sensitive to (1 syllabic complexity independent of supra-syllabic complexity, (2 supra-syllabic complexity independent of syllabic complexity and, (3 both syllabic and supra-syllabic complexity. High-resolution scans were acquired for 15 healthy adults. An exploratory anatomical ROI analysis of the supratemporal plane (STP identified bilateral regions within the anterior two-third of the planum temporale, the primary auditory cortices as well as the anterior two-third of the superior temporal gyrus that showed different patterns of sensitivity to syllabic and supra-syllabic information. These findings demonstrate that during passive listening of syllable sequences, sublexical information is processed automatically, and sensitivity to syllabic and supra-syllabic information is localized almost exclusively within the STP.
Deschamps, Isabelle; Tremblay, Pascale
The processing of fluent speech involves complex computational steps that begin with the segmentation of the continuous flow of speech sounds into syllables and words. One question that naturally arises pertains to the type of syllabic information that speech processes act upon. Here, we used functional magnetic resonance imaging to profile regions, using a combination of whole-brain and exploratory anatomical region-of-interest (ROI) approaches, that were sensitive to syllabic information during speech perception by parametrically manipulating syllabic complexity along two dimensions: (1) individual syllable complexity, and (2) sequence complexity (supra-syllabic). We manipulated the complexity of the syllable by using the simplest syllable template-a consonant and vowel (CV)-and inserting an additional consonant to create a complex onset (CCV). The supra-syllabic complexity was manipulated by creating sequences composed of the same syllable repeated six times (e.g., /pa-pa-pa-pa-pa-pa/) and sequences of three different syllables each repeated twice (e.g., /pa-ta-ka-pa-ta-ka/). This parametrical design allowed us to identify brain regions sensitive to (1) syllabic complexity independent of supra-syllabic complexity, (2) supra-syllabic complexity independent of syllabic complexity and, (3) both syllabic and supra-syllabic complexity. High-resolution scans were acquired for 15 healthy adults. An exploratory anatomical ROI analysis of the supratemporal plane (STP) identified bilateral regions within the anterior two-third of the planum temporale, the primary auditory cortices as well as the anterior two-third of the superior temporal gyrus that showed different patterns of sensitivity to syllabic and supra-syllabic information. These findings demonstrate that during passive listening of syllable sequences, sublexical information is processed automatically, and sensitivity to syllabic and supra-syllabic information is localized almost exclusively within the STP.
Roberts, Brian; Summers, Robert J; Bailey, Peter J
How speech is separated perceptually from other speech remains poorly understood. In a series of experiments, perceptual organisation was probed by presenting three-formant (F1+F2+F3) analogues of target sentences dichotically, together with a competitor for F2 (F2C), or for F2+F3, which listeners must reject to optimise recognition. To control for energetic masking, the competitor was always presented in the opposite ear to the corresponding target formant(s). Sine-wave speech was used initially, and different versions of F2C were derived from F2 using separate manipulations of its amplitude and frequency contours. F2Cs with time-varying frequency contours were highly effective competitors, whatever their amplitude characteristics, whereas constant-frequency F2Cs were ineffective. Subsequent studies used synthetic-formant speech to explore the effects of manipulating the rate and depth of formant-frequency change in the competitor. Competitor efficacy was not tuned to the rate of formant-frequency variation in the target sentences; rather, the reduction in intelligibility increased with competitor rate relative to the rate for the target sentences. Therefore, differences in speech rate may not be a useful cue for separating the speech of concurrent talkers. Effects of competitors whose depth of formant-frequency variation was scaled by a range of factors were explored using competitors derived either by inverting the frequency contour of F2 about its geometric mean (plausibly speech-like pattern) or by using a regular and arbitrary frequency contour (triangle wave, not plausibly speech-like) matched to the average rate and depth of variation for the inverted F2C. Competitor efficacy depended on the overall depth of frequency variation, not depth relative to that for the other formants. Furthermore, the triangle-wave competitors were as effective as their more speech-like counterparts. Overall, the results suggest that formant-frequency variation is critical for
Tse, Chun-Yu; Gratton, Gabriele; Garnsey, Susan M; Novak, Michael A; Fabiani, Monica
Information from different modalities is initially processed in different brain areas, yet real-world perception often requires the integration of multisensory signals into a single percept. An example is the McGurk effect, in which people viewing a speaker whose lip movements do not match the utterance perceive the spoken sounds incorrectly, hearing them as more similar to those signaled by the visual rather than the auditory input. This indicates that audiovisual integration is important for generating the phoneme percept. Here we asked when and where the audiovisual integration process occurs, providing spatial and temporal boundaries for the processes generating phoneme perception. Specifically, we wanted to separate audiovisual integration from other processes, such as simple deviance detection. Building on previous work employing ERPs, we used an oddball paradigm in which task-irrelevant audiovisually deviant stimuli were embedded in strings of non-deviant stimuli. We also recorded the event-related optical signal, an imaging method combining spatial and temporal resolution, to investigate the time course and neuroanatomical substrate of audiovisual integration. We found that audiovisual deviants elicit a short duration response in the middle/superior temporal gyrus, whereas audiovisual integration elicits a more extended response involving also inferior frontal and occipital regions. Interactions between audiovisual integration and deviance detection processes were observed in the posterior/superior temporal gyrus. These data suggest that dynamic interactions between inferior frontal cortex and sensory regions play a significant role in multimodal integration.
Akahane-Yamada, Reiko; Komaki, Ryo; Kubo, Rieko
Komaki and Akahane-Yamada (Proc. ICA2004) used 2AFC translation task in vocabulary training, in which the target word is presented visually in orthographic form of one language, and the appropriate meaning in another language has to be chosen between two choices. Present paper examined the effect of audio-visual presentation of target word when native speakers of Japanese learn to translate English words into Japanese. Pairs of English words contrasted in several phonemic distinctions (e.g., /r/-/l/, /b/-/v/, etc.) were used as word materials, and presented in three conditions; visual-only (V), audio-only (A), and audio-visual (AV) presentations. Identification accuracy of those words produced by two talkers was also assessed. During pretest, the accuracy for A stimuli was lowest, implying that insufficient translation ability and listening ability interact with each other when aurally presented word has to be translated. However, there was no difference in accuracy between V and AV stimuli, suggesting that participants translate the words depending on visual information only. The effect of translation training using AV stimuli did not transfer to identification ability, showing that additional audio information during translation does not help improve speech perception. Further examination is necessary to determine the effective L2 training method. [Work supported by TAO, Japan.
Akahane-Yamada, Reiko; Komaki, Ryo; Kubo, Rieko
Komaki and Akahane-Yamada (Proc. ICA2004) used 2AFC translation task in vocabulary training, in which the target word is presented visually in orthographic form of one language, and the appropriate meaning in another language has to be chosen between two choices. Present paper examined the effect of audio-visual presentation of target word when native speakers of Japanese learn to translate English words into Japanese. Pairs of English words contrasted in several phonemic distinctions (e.g., /r/-/l/, /b/-/v/, etc.) were used as word materials, and presented in three conditions; visual-only (V), audio-only (A), and audio-visual (AV) presentations. Identification accuracy of those words produced by two talkers was also assessed. During pretest, the accuracy for A stimuli was lowest, implying that insufficient translation ability and listening ability interact with each other when aurally presented word has to be translated. However, there was no difference in accuracy between V and AV stimuli, suggesting that participants translate the words depending on visual information only. The effect of translation training using AV stimuli did not transfer to identification ability, showing that additional audio information during translation does not help improve speech perception. Further examination is necessary to determine the effective L2 training method. [Work supported by TAO, Japan.
Chuen, Lorraine; Schutz, Michael
An observer's inference that multimodal signals originate from a common underlying source facilitates cross-modal binding. This 'unity assumption' causes asynchronous auditory and visual speech streams to seem simultaneous (Vatakis & Spence, Perception & Psychophysics, 69(5), 744-756, 2007). Subsequent tests of non-speech stimuli such as musical and impact events found no evidence for the unity assumption, suggesting the effect is speech-specific (Vatakis & Spence, Acta Psychologica, 127(1), 12-23, 2008). However, the role of amplitude envelope (the changes in energy of a sound over time) was not previously appreciated within this paradigm. Here, we explore whether previous findings suggesting speech-specificity of the unity assumption were confounded by similarities in the amplitude envelopes of the contrasted auditory stimuli. Experiment 1 used natural events with clearly differentiated envelopes: single notes played on either a cello (bowing motion) or marimba (striking motion). Participants performed an un-speeded temporal order judgments task; viewing audio-visually matched (e.g., marimba auditory with marimba video) and mismatched (e.g., cello auditory with marimba video) versions of stimuli at various stimulus onset asynchronies, and were required to indicate which modality was presented first. As predicted, participants were less sensitive to temporal order in matched conditions, demonstrating that the unity assumption can facilitate the perception of synchrony outside of speech stimuli. Results from Experiments 2 and 3 revealed that when spectral information was removed from the original auditory stimuli, amplitude envelope alone could not facilitate the influence of audiovisual unity. We propose that both amplitude envelope and spectral acoustic cues affect the percept of audiovisual unity, working in concert to help an observer determine when to integrate across modalities.
Kwon, Jinhwan; Ogawa, Ken-ichiro; Miyake, Yoshihiro
Visual motion information from dynamic environments is important in multisensory temporal perception. However, it is unclear how visual motion information influences the integration of multisensory temporal perceptions. We investigated whether visual apparent motion affects audiovisual temporal perception. Visual apparent motion is a phenomenon in which two flashes presented in sequence in different positions are perceived as continuous motion. Across three experiments, participants performed temporal order judgment (TOJ) tasks. Experiment 1 was a TOJ task conducted in order to assess audiovisual simultaneity during perception of apparent motion. The results showed that the point of subjective simultaneity (PSS) was shifted toward a sound-lead stimulus, and the just noticeable difference (JND) was reduced compared with a normal TOJ task with a single flash. This indicates that visual apparent motion affects audiovisual simultaneity and improves temporal discrimination in audiovisual processing. Experiment 2 was a TOJ task conducted in order to remove the influence of the amount of flash stimulation from Experiment 1. The PSS and JND during perception of apparent motion were almost identical to those in Experiment 1, but differed from those for successive perception when long temporal intervals were included between two flashes without motion. This showed that the result obtained under the apparent motion condition was unaffected by the amount of flash stimulation. Because apparent motion was produced by a constant interval between two flashes, the results may be accounted for by specific prediction. In Experiment 3, we eliminated the influence of prediction by randomizing the intervals between the two flashes. However, the PSS and JND did not differ from those in Experiment 1. It became clear that the results obtained for the perception of visual apparent motion were not attributable to prediction. Our findings suggest that visual apparent motion changes temporal
Murakami, Takenobu; Restle, Julia; Ziemann, Ulf
A left-hemispheric cortico-cortical network involving areas of the temporoparietal junction (Tpj) and the posterior inferior frontal gyrus (pIFG) is thought to support sensorimotor integration of speech perception into articulatory motor activation, but how this network links with the lip area of the primary motor cortex (M1) during speech…
The current study investigated the impact of recasts together with form-focused instruction (FFI) on the development of second language speech perception and production of English /?/ by Japanese learners. Forty-five learners were randomly assigned to three groups--FFI recasts, FFI only, and Control--and exposed to four hours of communicatively…
Full Text Available This article discusses the connection of hemispheric control over audioverbal perception processes and such individual features as “leading hand” (right-handedness and lefthandedness. We present a literature review and description of our research to provide evidence of the complexity and ambiguity of this connection. The method of dichotic listening was used for diagnosing audioverbal perception lateralization. This method allows estimation of the right-ear coefficient (REC, the efficiency coefficient (EC, and the effectiveness ratio (ER of different aspects of audioverbal perception. Our research involved 47 persons with a leading right hand (mean age, 29.04±9.97 years and 32 persons with a leading left hand (mean age, 29.41±10.34 years. Different hypotheses about the mechanisms of hemispheric control over audioverbal and motor processes were assessed. The research showed that both the leftand right-handers’ audioverbal perception characteristics depended mainly on right-hemisphere activity. The most dynamic and sensitive index of the functioning of the two hemispheres during dichotic listening was the efficiency coefficient of stimuli reproduction through the left ear (EC of the left ear. It turns out that this index depends on the coincidence/noncoincidence of the leading hemispheres in speech and motor processes. The highest efficiency of audioverbal perception revealed itself in the left-handers with a leading left ear (the hemispheric-control coincidence, and the lowest efficiency was in the left-handers with a leading right ear (the hemispheric-control divergence. The right-handers were characterized by less variation in values, although the influence of the coincidence/noncoincidence of the leading hemispheres in speech and motor processes also revealed itself as a tendency. This consistent pattern points out the necessity for further research on asymmetries of the different modalities that takes into account their probable
Heikkinen, Jenna; Jansson-Verkasalo, Eira; Toivanen, Juhani; Suominen, Kalervo; Väyrynen, Eero; Moilanen, Irma; Seppänen, Tapio
Asperger's syndrome (AS) belongs to the group of autism spectrum disorders and is characterized by deficits in social interaction, as manifested e.g. by the lack of social or emotional reciprocity. The disturbance causes clinically significant impairment in social interaction. Abnormal prosody has been frequently identified as a core feature of AS. There are virtually no studies on recognition of basic emotions from speech. This study focuses on how adolescents with AS (n=12) and their typically developed controls (n=15) recognize the basic emotions happy, sad, angry, and 'neutral' from speech prosody. Adolescents with AS recognized basic emotions from speech prosody as well as their typically developed controls did. Possibly the recognition of basic emotions develops during the childhood.
Strelcyk, Olaf; Dau, Torsten
Hearing-impaired people often experience great difficulty with speech communication when background noise is present, even if reduced audibility has been compensated for. Other impairment factors must be involved. In order to minimize confounding effects, the subjects participating in this study...... consisted of groups with homogeneous, symmetric audiograms. The perceptual listening experiments assessed the intelligibility of full-spectrum as well as low-pass filtered speech in the presence of stationary and fluctuating interferers, the individual's frequency selectivity and the integrity of temporal...... modulation were obtained. In addition, these binaural and monaural thresholds were measured in a stationary background noise in order to assess the persistence of the fine-structure processing to interfering noise. Apart from elevated speech reception thresholds, the hearing impaired listeners showed poorer...
in regard to perceived level variation, loudness and overall acceptance. In the second experiment, two signals containing speech and noise at 75 dB SPL RMS-level, were compressed with six compression ratios from 1:1 to 10:1 and three release times from 40 ms to 4000 ms. In this experiment, subjects rated...... the signals in regard to loudness, speech clarity, noisiness and overall acceptance. Based on the results, a criterion for selecting compression parameters that yield some level-variation in the output signal, while still keeping the overall user-acceptance at a tolerable level, is suggested. It is also...... discussed how differences in speech and noise components seem to influence listeners ratings of the test signals. General recommendations for a fitting rule, that takes into account the spectral and temporal characteristics of the input signal, is given together with suggestions for further studies. Finally...
Full Text Available Here we use two filtered speech tasks to investigate children’s processing of slow (<4 Hz versus faster (~33 Hz temporal modulations in speech. We compare groups of children with either developmental dyslexia (Experiment 1 or speech and language impairments (SLIs, Experiment 2 to groups of typically-developing (TD children age-matched to each disorder group. Ten nursery rhymes were filtered so that their modulation frequencies were either low-pass filtered (< 4 Hz or band-pass filtered (22 – 40 Hz. Recognition of the filtered nursery rhymes was tested in a picture recognition multiple choice paradigm. Children with dyslexia aged 10 years showed equivalent recognition overall to TD controls for both the low-pass and band-pass filtered stimuli, but showed significantly impaired acoustic learning during the experiment from low-pass filtered targets. Children with oral speech and language impairments (SLIs aged 9 years showed significantly poorer recognition of band pass filtered targets compared to their TD controls, and showed comparable acoustic learning effects to TD children during the experiment. The SLI sample were also divided into children with and without phonological difficulties. The children with both SLI and phonological difficulties were impaired in recognising both kinds of filtered speech. These data are suggestive of impaired temporal sampling of the speech signal at different modulation rates by children with different kinds of developmental language disorder. Both SLI and dyslexic samples showed impaired discrimination of amplitude rise times. Implications of these findings for a temporal sampling framework for understanding developmental language disorders are discussed.
Weidema, Joey L; Roncaglia-Denissen, M P; Honing, Henkjan
Whether pitch in language and music is governed by domain-specific or domain-general cognitive mechanisms is contentiously debated. The aim of the present study was to investigate whether mechanisms governing pitch contour perception operate differently when pitch information is interpreted as either speech or music. By modulating listening mode, this study aspired to demonstrate that pitch contour perception relies on domain-specific cognitive mechanisms, which are regulated by top-down influences from language and music. Three groups of participants (Mandarin speakers, Dutch speaking non-musicians, and Dutch musicians) were exposed to identical pitch contours, and tested on their ability to identify these contours in a language and musical context. Stimuli consisted of disyllabic words spoken in Mandarin, and melodic tonal analogs, embedded in a linguistic and melodic carrier phrase, respectively. Participants classified identical pitch contours as significantly different depending on listening mode. Top-down influences from language appeared to alter the perception of pitch contour in speakers of Mandarin. This was not the case for non-musician speakers of Dutch. Moreover, this effect was lacking in Dutch speaking musicians. The classification patterns of pitch contours in language and music seem to suggest that domain-specific categorization is modulated by top-down influences from language and music.
Weidema, Joey L.; Roncaglia-Denissen, M. P.; Honing, Henkjan
Whether pitch in language and music is governed by domain-specific or domain-general cognitive mechanisms is contentiously debated. The aim of the present study was to investigate whether mechanisms governing pitch contour perception operate differently when pitch information is interpreted as either speech or music. By modulating listening mode, this study aspired to demonstrate that pitch contour perception relies on domain-specific cognitive mechanisms, which are regulated by top–down influences from language and music. Three groups of participants (Mandarin speakers, Dutch speaking non-musicians, and Dutch musicians) were exposed to identical pitch contours, and tested on their ability to identify these contours in a language and musical context. Stimuli consisted of disyllabic words spoken in Mandarin, and melodic tonal analogs, embedded in a linguistic and melodic carrier phrase, respectively. Participants classified identical pitch contours as significantly different depending on listening mode. Top–down influences from language appeared to alter the perception of pitch contour in speakers of Mandarin. This was not the case for non-musician speakers of Dutch. Moreover, this effect was lacking in Dutch speaking musicians. The classification patterns of pitch contours in language and music seem to suggest that domain-specific categorization is modulated by top–down influences from language and music. PMID:27313552
Joey L. Weidema
Full Text Available Whether pitch in language and music is governed by domain-specific or domain-general cognitive mechanisms is contentiously debated. The aim of the present study was to investigate whether mechanisms governing pitch contour perception operate differently when pitch information is interpreted as either speech or music. By modulating listening mode, this study aspired to demonstrate that pitch contour perception relies on domain-specific cognitive mechanisms, which are regulated by top-down influences from language and music. Three groups of participants (Mandarin speakers, Dutch speaking non-musicians, and Dutch musicians were exposed to identical pitch contours, and tested on their ability to identify these contours in a language and musical context. Stimuli consisted of disyllabic words spoken in Mandarin, and melodic tonal analogues, embedded in a linguistic and melodic carrier phrase, respectively. Participants classified identical pitch contours as significantly different depending on listening mode. Top-down influences from language appeared to alter the perception of pitch contour in speakers of Mandarin. This was not the case for non-musician speakers of Dutch. Moreover, this effect was lacking in Dutch speaking musicians. The classification patterns of pitch contours in language and music seem to suggest that domain-specific categorization is modulated by top-down influences from language and music.
Bione, Tiago; Grimshaw, Jennica; Cardoso, Walcir
As stated in Cardoso, Smith, and Garcia Fuentes (2015), second language researchers and practitioners have explored the pedagogical capabilities of Text-To-Speech synthesizers (TTS) for their potential to enhance the acquisition of writing (e.g. Kirstein, 2006), vocabulary and reading (e.g. Proctor, Dalton, & Grisham, 2007), and pronunciation…
McCarthy, Kathleen M.; Mahon, Merle; Rosen, Stuart; Evans, Bronwen G.
The majority of bilingual speech research has focused on simultaneous bilinguals. Yet, in immigrant communities, children are often initially exposed to their family language (L1), before becoming gradually immersed in the host country's language (L2). This is typically referred to as sequential bilingualism. Using a longitudinal design, this…
Full Text Available External degradations in incoming speech reduce understanding, and hearing impairment further compounds the problem. While cognitive mechanisms alleviate some of the difficulties, their effectiveness may change with age. In our research, reviewed here, we investigated cognitive compensation with hearing impairment, cochlear implants, and aging, via (a phonemic restoration as a measure of top-down filling of missing speech, (b listening effort and response times as a measure of increased cognitive processing, and (c visual world paradigm and eye gazing as a measure of the use of context and its time course. Our results indicate that between speech degradations and their cognitive compensation, there is a fine balance that seems to vary greatly across individuals. Hearing impairment or inadequate hearing device settings may limit compensation benefits. Cochlear implants seem to allow the effective use of sentential context, but likely at the cost of delayed processing. Linguistic and lexical knowledge, which play an important role in compensation, may be successfully employed in advanced age, as some compensatory mechanisms seem to be preserved. These findings indicate that cognitive compensation in hearing impairment can be highly complicated—not always absent, but also not easily predicted by speech intelligibility tests only.
Markham, Chris; Dean, Taraneh
The true impact of speech and language difficulties (SaLD) on children's lives and the effectiveness of intervention is unknown. Within other fields of paediatric healthcare, clinicians and policy-makers are increasingly emphasizing the utility of Health-Related Quality of Life (HRQoL) studies and measures. SaLT has a variety of measures to assess…
Mitterer, Holger; Kim, Sahyang; Cho, Taehong
In connected speech, phonological assimilation to neighboring words can lead to pronunciation variants (e.g., "garden bench" [arrow right] "garde'm' bench"). A large body of literature suggests that listeners use the phonetic context to reconstruct the intended word for assimilation types that often lead to incomplete assimilations (e.g., a…
Blood, Gordon W.; Blood, Ingrid M.; Coniglio, Amy D.; Finke, Erinn H.; Boyle, Michael P.
Children with autism spectrum disorders (ASD) are primary targets for bullies and victimization. Research shows school personnel may be uneducated about bullying and ways to intervene. Speech-language pathologists (SLPs) in schools often work with children with ASD and may have victims of bullying on their caseloads. These victims may feel most…
Clarke, Jeanne; Pals, Carina; Benard, Michel R.; Bhargava, Pranesh; Saija, Jefta; Sarampalis, Anastasios; Wagner, Anita; Gaudrain, Etienne
External degradations in incoming speech reduce understanding, and hearing impairment further compounds the problem. While cognitive mechanisms alleviate some of the difficulties, their effectiveness may change with age. In our research, reviewed here, we investigated cognitive compensation with hearing impairment, cochlear implants, and aging, via (a) phonemic restoration as a measure of top-down filling of missing speech, (b) listening effort and response times as a measure of increased cognitive processing, and (c) visual world paradigm and eye gazing as a measure of the use of context and its time course. Our results indicate that between speech degradations and their cognitive compensation, there is a fine balance that seems to vary greatly across individuals. Hearing impairment or inadequate hearing device settings may limit compensation benefits. Cochlear implants seem to allow the effective use of sentential context, but likely at the cost of delayed processing. Linguistic and lexical knowledge, which play an important role in compensation, may be successfully employed in advanced age, as some compensatory mechanisms seem to be preserved. These findings indicate that cognitive compensation in hearing impairment can be highly complicated—not always absent, but also not easily predicted by speech intelligibility tests only.
Full Text Available Natural sounds, including vocal communication sounds, contain critical information at multiple time scales. Two essential temporal modulation rates in speech have been argued to be in the low gamma band (~20-80 ms duration information and the theta band (~150-300 ms, corresponding to segmental and syllabic modulation rates, respectively. On one hypothesis, auditory cortex implements temporal integration using time constants closely related to these values. The neural correlates of a proposed dual temporal window mechanism in human auditory cortex remain poorly understood. We recorded MEG responses from participants listening to non-speech auditory stimuli with different temporal structures, created by concatenating frequency-modulated segments of varied segment durations. We show that these non-speech stimuli with temporal structure matching speech-relevant scales (~25 ms and ~200 ms elicit reliable phase tracking in the corresponding associated oscillatory frequencies (low gamma and theta bands. In contrast, stimuli with non-matching temporal structure do not. Furthermore, the topography of theta band phase tracking shows rightward lateralization while gamma band phase tracking occurs bilaterally. The results support the hypothesis that there exists multi-time resolution processing in cortex on discontinuous scales and provide evidence for an asymmetric organization of temporal analysis (asymmetrical sampling in time, AST. The data argue for a macroscopic-level neural mechanism underlying multi-time resolution processing: the sliding and resetting of intrinsic temporal windows on privileged time scales.
Pinget, A.C.H.; Bosker, H.R.; Quené, H.; de Jong, N.H.
Oral fluency and foreign accent distinguish L2 from L1 speech production. In language testing practices, both fluency and accent are usually assessed by raters. This study investigates what exactly native raters of fluency and accent take into account when judging L2. Our aim is to explore the relat
Gaskell, M. Gareth; Snoeren, Natalie D.
Models of compensation for phonological variation in spoken word recognition differ in their ability to accommodate complete assimilatory alternations (such as run assimilating fully to rum in the context of a quick run picks you up). Two experiments addressed whether such complete changes can be observed in casual speech, and if so, whether they…
Goswami, Usha; Cumming, Ruth; Chait, Maria; Huss, Martina; Mead, Natasha; Wilson, Angela M; Barnes, Lisa; Fosker, Tim
Here we use two filtered speech tasks to investigate children's processing of slow (children with either developmental dyslexia (Experiment 1) or speech and language impairments (SLIs, Experiment 2) to groups of typically-developing (TD) children age-matched to each disorder group. Ten nursery rhymes were filtered so that their modulation frequencies were either low-pass filtered (multiple choice paradigm. Children with dyslexia aged 10 years showed equivalent recognition overall to TD controls for both the low-pass and band-pass filtered stimuli, but showed significantly impaired acoustic learning during the experiment from low-pass filtered targets. Children with oral SLIs aged 9 years showed significantly poorer recognition of band pass filtered targets compared to their TD controls, and showed comparable acoustic learning effects to TD children during the experiment. The SLI samples were also divided into children with and without phonological difficulties. The children with both SLI and phonological difficulties were impaired in recognizing both kinds of filtered speech. These data are suggestive of impaired temporal sampling of the speech signal at different modulation rates by children with different kinds of developmental language disorder. Both SLI and dyslexic samples showed impaired discrimination of amplitude rise times. Implications of these findings for a temporal sampling framework for understanding developmental language disorders are discussed.
Usha eGoswami; Ruth eCumming; Maria eChait; Natasha eMead; Angela Marie Wilson; Lisa eBarnes; Tim eFosker
This is the final version of the article. It first appeared from Frontiers via http://dx.doi.org/10.3389/fpsyg.2016.00791 Here we use two filtered speech tasks to investigate children’s processing of slow (
Hazan, Valerie; Messaoud-Galusi, Souhila; Rosen, Stuart
Purpose: In this study, the authors aimed to determine whether children with dyslexia (hereafter referred to as "DYS children") are more affected than children with average reading ability (hereafter referred to as "AR children") by talker and intonation variability when perceiving speech in noise. Method: Thirty-four DYS and 25 AR children were…
Bryan, Karen; Gregory, Juliette
The purpose of this research was to ascertain the views of staff and managers within a youth offending team on their experiences of working with a speech and language therapist (SLT). The model of therapy provision was similar to the whole-systems approach used in schools. The impact of the service on language outcomes is reported elsewhere…
Petersen, Bjørn; Sørensen, Stine Derdau; Pedersen, Ellen Raben
their standard school schedule and received no music training. Before and after the intervention period, both groups completed a set of tests for perception of music, speech and emotional prosody. In addition, the participants filled out a questionnaire which examined music listening habits and enjoyment...... measures of rehabilitation are important throughout adolescence. Music training may provide a beneficial method of strengthening not only music perception, but also linguistic skills, particularly prosody. The purpose of this study was to examine perception of music and speech and music engagement...... of adolescent CI users and the potential effects of an intensive musical ear training program. METHODS Eleven adolescent CI users participated in a short intensive training program involving music making activities and computer based listening exercises. Ten NH agemates formed a reference group, who followed...
Full Text Available Noise-vocoding is a transformation which, when applied to speech, severely reduces spectral resolution and eliminates periodicity, yielding a stimulus that sounds like a harsh whisper (Scott, Blank et al. 2000. This process simulates a cochlear implant, where the activity of many thousand hair cells in the inner ear is replaced by direct stimulation of the auditory nerve by a small number of tonotopically-arranged electrodes. Although a cochlear implant offers a powerful means of restoring some degree of hearing to profoundly deaf individuals, the outcomes for spoken communication are highly variable (Moore and Shannon 2009. Some variability may arise from differences in peripheral representation (e.g. the degree of residual nerve survival but some may reflect differences in higher-order linguistic processing. In order to explore this possibility, we used noise-vocoding to explore speech recognition and perceptual learning in normal-hearing listeners tested across several levels of the linguistic hierarchy: segments (consonants and vowels, single words, and sentences. Listeners improved significantly on all tasks across two test sessions. In the first session, individual differences analyses revealed two independently varying sources of variability: one lexico-semantic in nature and implicating the recognition of words and sentences, and the other an acoustic-phonetic factor associated with words and segments. However, consequent to learning, by the second session there was a more uniform covariance pattern concerning all stimulus types. A further analysis of phonetic feature recognition allowed greater insight into learning-related changes in perception and showed that, surprisingly, participants did not make full use of cues that were preserved in the stimuli (e.g. vowel duration. We discuss these findings in relation cochlear implantation, and suggest auditory training strategies to maximise speech recognition performance in the absence of
Giraud, Anne-Lise; Kleinschmidt, Andreas; Poeppel, David
Across multiple timescales, acoustic regularities of speech match rhythmic properties of both the auditory and motor systems. Syllabic rate corresponds to natural jaw-associated oscillatory rhythms, and phonemic length could reflect endogenous oscillatory auditory cortical properties. Hemispheric...... that spontaneous EEG power variations within the gamma range (phonemic rate) correlate best with left auditory cortical synaptic activity, while fluctuations within the theta range correlate best with that in the right. Power fluctuations in both ranges correlate with activity in the mouth premotor region...
Full Text Available OBJECTIVE: To investigate the performance of monaural and binaural beamforming technology with an additional noise reduction algorithm, in cochlear implant recipients. METHOD: This experimental study was conducted as a single subject repeated measures design within a large German cochlear implant centre. Twelve experienced users of an Advanced Bionics HiRes90K or CII implant with a Harmony speech processor were enrolled. The cochlear implant processor of each subject was connected to one of two bilaterally placed state-of-the-art hearing aids (Phonak Ambra providing three alternative directional processing options: an omnidirectional setting, an adaptive monaural beamformer, and a binaural beamformer. A further noise reduction algorithm (ClearVoice was applied to the signal on the cochlear implant processor itself. The speech signal was presented from 0° and speech shaped noise presented from loudspeakers placed at ±70°, ±135° and 180°. The Oldenburg sentence test was used to determine the signal-to-noise ratio at which subjects scored 50% correct. RESULTS: Both the adaptive and binaural beamformer were significantly better than the omnidirectional condition (5.3 dB±1.2 dB and 7.1 dB±1.6 dB (p<0.001 respectively. The best score was achieved with the binaural beamformer in combination with the ClearVoice noise reduction algorithm, with a significant improvement in SRT of 7.9 dB±2.4 dB (p<0.001 over the omnidirectional alone condition. CONCLUSIONS: The study showed that the binaural beamformer implemented in the Phonak Ambra hearing aid could be used in conjunction with a Harmony speech processor to produce substantial average improvements in SRT of 7.1 dB. The monaural, adaptive beamformer provided an averaged SRT improvement of 5.3 dB.
Full Text Available Many figurative expressions are fully conventionalized in everyday speech. Regarding the neural basis of figurative language processing, research has predominantly focused on metaphoric expressions in minimal semantic context. It remains unclear in how far metaphoric expressions during continuous text comprehension activate similar neural networks as isolated metaphors. We therefore investigated the processing of similes (figurative language, e.g. He smokes like a chimney! occurring in a short story.Sixteen healthy, male, native German speakers listened to similes that came about naturally in a short story, while blood-oxygenation-level-dependent (BOLD responses were measured with functional magnetic resonance imaging (fMRI. For the event-related analysis, similes were contrasted with non-figurative control sentences. The stimuli differed with respect to figurativeness, while they were matched for frequency of words, number of syllables, plausibility and comprehensibility.Similes contrasted with control sentences resulted in enhanced BOLD responses in the left inferior (IFG and adjacent middle frontal gyrus. Concrete control sentences as compared to similes activated the bilateral middle temporal gyri as well as the right precuneus and the left middle frontal gyrus.Activation of the left IFG for similes in a short story is consistent with results on single sentence metaphor processing. The findings strengthen the importance of the left inferior frontal region in the processing of abstract figurative speech during continuous, ecologically-valid speech comprehension; the processing of concrete semantic contents goes along with a down-regulation of bilateral temporal regions.
Lewkowicz, David J.; Flom, Ross
Binding is key in multisensory perception. This study investigated the audio-visual (A-V) temporal binding window in 4-, 5-, and 6-year-old children (total N = 120). Children watched a person uttering a syllable whose auditory and visual components were either temporally synchronized or desynchronized by 366, 500, or 666 ms. They were asked…
Full Text Available Arjun SinghDepartment of Pathology, Sri Venkateshwara Medical College Hospital and Research Centre, Pondicherry, IndiaPurpose: We use different methods to train our undergraduates. The patient-oriented problem-solving (POPS system is an innovative teaching–learning method that imparts knowledge, enhances intrinsic motivation, promotes self learning, encourages clinical reasoning, and develops long-lasting memory. The aim of this study was to develop POPS in teaching pathology, assess its effectiveness, and assess students’ preference for POPS over didactic lectures.Method: One hundred fifty second-year MBBS students were divided into two groups: A and B. Group A was taught by POPS while group B was taught by traditional lectures. Pre- and post-test numerical scores of both groups were evaluated and compared. Students then completed a self-structured feedback questionnaire for analysis.Results: The mean (SD difference in pre- and post-test scores of groups A and B was 15.98 (3.18 and 7.79 (2.52, respectively. The significance of the difference between scores of group A and group B teaching methods was 16.62 (P < 0.0001, as determined by the z-test. Improvement in post-test performance of group A was significantly greater than of group B, demonstrating the effectiveness of POPS. Students responded that POPS facilitates self-learning, helps in understanding topics, creates interest, and is a scientific approach to teaching. Feedback response on POPS was strong in 57.52% of students, moderate in 35.67%, and negative in only 6.81%, showing that 93.19% students favored POPS over simple lectures.Conclusion: It is not feasible to enforce the PBL method of teaching throughout the entire curriculum; However, POPS can be incorporated along with audiovisual aids to break the monotony of dialectic lectures and as alternative to PBL.Keywords: medical education, problem-solving exercise, problem-based learning
Ogawa, Akitoshi; Bordier, Cecile; Macaluso, Emiliano
The use of naturalistic stimuli to probe sensory functions in the human brain is gaining increasing interest. Previous imaging studies examined brain activity associated with the processing of cinematographic material using both standard "condition-based" designs, as well as "computational" methods based on the extraction of time-varying features of the stimuli (e.g. motion). Here, we exploited both approaches to investigate the neural correlates of complex visual and auditory spatial signals in cinematography. In the first experiment, the participants watched a piece of a commercial movie presented in four blocked conditions: 3D vision with surround sounds (3D-Surround), 3D with monaural sound (3D-Mono), 2D-Surround, and 2D-Mono. In the second experiment, they watched two different segments of the movie both presented continuously in 3D-Surround. The blocked presentation served for standard condition-based analyses, while all datasets were submitted to computation-based analyses. The latter assessed where activity co-varied with visual disparity signals and the complexity of auditory multi-sources signals. The blocked analyses associated 3D viewing with the activation of the dorsal and lateral occipital cortex and superior parietal lobule, while the surround sounds activated the superior and middle temporal gyri (S/MTG). The computation-based analyses revealed the effects of absolute disparity in dorsal occipital and posterior parietal cortices and of disparity gradients in the posterior middle temporal gyrus plus the inferior frontal gyrus. The complexity of the surround sounds was associated with activity in specific sub-regions of S/MTG, even after accounting for changes of sound intensity. These results demonstrate that the processing of naturalistic audio-visual signals entails an extensive set of visual and auditory areas, and that computation-based analyses can track the contribution of complex spatial aspects characterizing such life-like stimuli.
Full Text Available The use of naturalistic stimuli to probe sensory functions in the human brain is gaining increasing interest. Previous imaging studies examined brain activity associated with the processing of cinematographic material using both standard "condition-based" designs, as well as "computational" methods based on the extraction of time-varying features of the stimuli (e.g. motion. Here, we exploited both approaches to investigate the neural correlates of complex visual and auditory spatial signals in cinematography. In the first experiment, the participants watched a piece of a commercial movie presented in four blocked conditions: 3D vision with surround sounds (3D-Surround, 3D with monaural sound (3D-Mono, 2D-Surround, and 2D-Mono. In the second experiment, they watched two different segments of the movie both presented continuously in 3D-Surround. The blocked presentation served for standard condition-based analyses, while all datasets were submitted to computation-based analyses. The latter assessed where activity co-varied with visual disparity signals and the complexity of auditory multi-sources signals. The blocked analyses associated 3D viewing with the activation of the dorsal and lateral occipital cortex and superior parietal lobule, while the surround sounds activated the superior and middle temporal gyri (S/MTG. The computation-based analyses revealed the effects of absolute disparity in dorsal occipital and posterior parietal cortices and of disparity gradients in the posterior middle temporal gyrus plus the inferior frontal gyrus. The complexity of the surround sounds was associated with activity in specific sub-regions of S/MTG, even after accounting for changes of sound intensity. These results demonstrate that the processing of naturalistic audio-visual signals entails an extensive set of visual and auditory areas, and that computation-based analyses can track the contribution of complex spatial aspects characterizing such life
Møller, Anders Kalsgaard; Hoffmann, Pablo F.; Carrozzino, Marcello
The state-of-the-art speech intelligibility tests are created with the purpose of evaluating acoustic communication devices and not for evaluating audio-visual virtual reality systems. This paper present a novel method to evaluate a communication situation based on both the speech intelligibility...... and the indexical characteristics of the speaker. The results will be available in the final paper. Index Terms: speech intelligibility , virtual reality, body language, telecommunication.......The state-of-the-art speech intelligibility tests are created with the purpose of evaluating acoustic communication devices and not for evaluating audio-visual virtual reality systems. This paper present a novel method to evaluate a communication situation based on both the speech intelligibility...
Noordenbos, M. W.; Segers, E.; Serniclaes, W.; Mitterer, H.; Verhoeven, L.
There is ample evidence that individuals with dyslexia have a phonological deficit. A growing body of research also suggests that individuals with dyslexia have problems with categorical perception, as evidenced by weaker discrimination of between-category differences and better discrimination of within-category differences compared to average…
Noordenbos, M.W.; Segers, P.C.J.; Serniclaes, W.; Mitterer, H.A.; Verhoeven, L.T.W.
There is ample evidence that individuals with dyslexia have a phonological deficit. A growing body of research also suggests that individuals with dyslexia have problems with categorical perception, as evidenced by weaker discrimination of between-category differences and better discrimination of wi
Kakouros, Sofoklis; Räsänen, Okko
Numerous studies have examined the acoustic correlates of sentential stress and its underlying linguistic functionality. However, the mechanism that connects stress cues to the listener's attentional processing has remained unclear. Also, the learnability versus innateness of stress perception has not been widely discussed. In this work, we…
Reed, Charlotte M.; And Others
Small-set segmental identification experiments were conducted with three deaf-blind subjects who were highly experienced users of the Tadoma method. Systematic variations in the positioning of the hand on the speaker's face for Tadoma produced systematic effects on percent-correct scores, information transfer, and perception of individual…
Wellington da SILVA
Full Text Available ABSTRACT This study was conducted to investigate whether the listeners' culture and mother language influence the perception of emotions through speech and which acoustic cues listeners use in this process. Swedish and Brazilian listeners were presented with authentic emotional speech samples of Brazilian Portuguese and Swedish. They judged on 5-point Likert scales the expression of basic emotions as described by eight adjectives in the utterances in Brazilian Portuguese and the expression of five emotional dimensions in the utterances in Swedish. The PCA technique revealed that two components explain more than 94% of the variance of the judges' responses in both experiments. These components were predicted through multiple linear regressions from twelve acoustic parameters automatically computed from the utterances. The results point to a similar perception of the emotions between both cultures.
MacSweeney, Mairéad; Woll, Bencie; Campbell, Ruth; McGuire, Philip K; David, Anthony S; Williams, Steven C R; Suckling, John; Calvert, Gemma A; Brammer, Michael J
In order to understand the evolution of human language, it is necessary to explore the neural systems that support language processing in its many forms. In particular, it is informative to separate those mechanisms that may have evolved for sensory processing (hearing) from those that have evolved to represent events and actions symbolically (language). To what extent are the brain systems that support language processing shaped by auditory experience and to what extent by exposure to language, which may not necessarily be acoustically structured? In this first neuroimaging study of the perception of British Sign Language (BSL), we explored these questions by measuring brain activation using functional MRI in nine hearing and nine congenitally deaf native users of BSL while they performed a BSL sentence-acceptability task. Eight hearing, non-signing subjects performed an analogous task that involved audio-visual English sentences. The data support the argument that there are both modality-independent and modality-dependent language localization patterns in native users. In relation to modality-independent patterns, regions activated by both BSL in deaf signers and by spoken English in hearing non-signers included inferior prefrontal regions bilaterally (including Broca's area) and superior temporal regions bilaterally (including Wernicke's area). Lateralization patterns were similar for the two languages. There was no evidence of enhanced right-hemisphere recruitment for BSL processing in comparison with audio-visual English. In relation to modality-specific patterns, audio-visual speech in hearing subjects generated greater activation in the primary and secondary auditory cortices than BSL in deaf signers, whereas BSL generated enhanced activation in the posterior occipito-temporal regions (V5), reflecting the greater movement component of BSL. The influence of hearing status on the recruitment of sign language processing systems was explored by comparing deaf
David L Woods
Full Text Available Hearing aids (HAs only partially restore the ability of older hearing impaired (OHI listeners to understand speech in noise, due in large part to persistent deficits in consonant identification. Here, we investigated whether adaptive perceptual training would improve consonant-identification in noise in sixteen aided OHI listeners who underwent 40 hours of computer-based training in their homes. Listeners identified 20 onset and 20 coda consonants in 9,600 consonant-vowel-consonant (CVC syllables containing different vowels (/ɑ/, /i/, or /u/ and spoken by four different talkers. Consonants were presented at three consonant-specific signal-to-noise ratios (SNRs spanning a 12 dB range. Noise levels were adjusted over training sessions based on d' measures. Listeners were tested before and after training to measure (1 changes in consonant-identification thresholds using syllables spoken by familiar and unfamiliar talkers, and (2 sentence reception thresholds (SeRTs using two different sentence tests. Consonant-identification thresholds improved gradually during training. Laboratory tests of d' thresholds showed an average improvement of 9.1 dB, with 94% of listeners showing statistically significant training benefit. Training normalized consonant confusions and improved the thresholds of some consonants into the normal range. Benefits were equivalent for onset and coda consonants, syllables containing different vowels, and syllables presented at different SNRs. Greater training benefits were found for hard-to-identify consonants and for consonants spoken by familiar than unfamiliar talkers. SeRTs, tested with simple sentences, showed less elevation than consonant-identification thresholds prior to training and failed to show significant training benefit, although SeRT improvements did correlate with improvements in consonant thresholds. We argue that the lack of SeRT improvement reflects the dominant role of top-down semantic processing in
Fridriksson, Julius; Hubbard, H Isabel; Hudspeth, Sarah Grace; Holland, Audrey L; Bonilha, Leonardo; Fromm, Davida; Rorden, Chris
A distinguishing feature of Broca's aphasia is non-fluent halting speech typically involving one to three words per utterance. Yet, despite such profound impairments, some patients can mimic audio-visual speech stimuli enabling them to produce fluent speech in real time. We call this effect 'speech entrainment' and reveal its neural mechanism as well as explore its usefulness as a treatment for speech production in Broca's aphasia. In Experiment 1, 13 patients with Broca's aphasia were tested in three conditions: (i) speech entrainment with audio-visual feedback where they attempted to mimic a speaker whose mouth was seen on an iPod screen; (ii) speech entrainment with audio-only feedback where patients mimicked heard speech; and (iii) spontaneous speech where patients spoke freely about assigned topics. The patients produced a greater variety of words using audio-visual feedback compared with audio-only feedback and spontaneous speech. No difference was found between audio-only feedback and spontaneous speech. In Experiment 2, 10 of the 13 patients included in Experiment 1 and 20 control subjects underwent functional magnetic resonance imaging to determine the neural mechanism that supports speech entrainment. Group results with patients and controls revealed greater bilateral cortical activation for speech produced during speech entrainment compared with spontaneous speech at the junction of the anterior insula and Brodmann area 47, in Brodmann area 37, and unilaterally in the left middle temporal gyrus and the dorsal portion of Broca's area. Probabilistic white matter tracts constructed for these regions in the normal subjects revealed a structural network connected via the corpus callosum and ventral fibres through the extreme capsule. Unilateral areas were connected via the arcuate fasciculus. In Experiment 3, all patients included in Experiment 1 participated in a 6-week treatment phase using speech entrainment to improve speech production. Behavioural and
Andersen, Tobias; Mamassian, Pascal
leaving only unsigned stimulus transients as the basis for audiovisual integration. Facilitation of luminance detection occurred even with varying audiovisual stimulus onset asynchrony and even when the sound lagged behind the luminance change by 75 ms supporting the interpretation that perceptual...
Babin, Pierre, Ed.
A series of twelve essays discuss the use of audiovisuals in religious education. The essays are divided into three sections: one which draws on the ideas of Marshall McLuhan and other educators to explore the newest ideas about audiovisual language and faith, one that describes how to learn and use the new language of audio and visual images, and…
Manan, Hanani Abdul; Franz, Elizabeth A; Yusoff, Ahmad Nazlim; Mukari, Siti Zamratol-Mai Sarah
In the present study, brain activation associated with speech perception processing was examined across four groups of adult participants with age ranges between 20 and 65 years, using functional MRI (fMRI). Cognitive performance demonstrates that performance accuracy declines with age. fMRI results reveal that all four groups of participants activated the same brain areas. The same brain activation pattern was found in all activated areas (except for the right superior temporal gyrus and right middle temporal gyrus); brain activity was increased from group 1 (20-29 years) to group 2 (30-39 years). However, it decreased in group 3 (40-49 years) with further decreases in group 4 participants (50-65 years). Result also reveals that three brain areas (superior temporal gyrus, Heschl's gyrus and cerebellum) showed changes in brain laterality in the older participants, akin to a shift from left-lateralized to right-lateralized activity. The onset of this change was different across brain areas. Based on these findings we suggest that, whereas all four groups of participants used the same areas in processing, the engagement and recruitment of those areas differ with age as the brain grows older. Findings are discussed in the context of corroborating evidence of neural changes with age.
Mentovich, Avital; Huq, Aziz; Cerf, Moran
The U.S. Supreme Court has increasingly expanded the scope of constitutional rights granted to corporations and other collective entities. Although this tendency receives widespread public and media attention, little empirical research examines how people ascribe rights, commonly thought to belong to natural persons, to corporations. This article explores this issue in 3 studies focusing on different rights (religious liberty, privacy, and free speech). We examined participants' willingness to grant a given right while manipulating the type of entity at stake (from small businesses, to larger corporations, to for-profit and nonprofit companies), and the identity of the right holder (from employees, to owners, to the company itself as a separate entity). We further examined the role of political ideology in perceptions of rights. Results indicated a significant decline in the degree of recognition of entities' rights (the company itself) in comparison to natural persons' rights (owners and employees). Results also demonstrated an effect of the type of entity at stake: Larger, for-profit businesses were less likely to be viewed as rights holders compared with nonprofit entities. Although both tendencies persisted across the ideological spectrum, ideological differences emerged in the relations between corporate and individual rights: these were positively related among conservatives but negatively related among liberals. Finally, we found that the desire to protect citizens (compared with businesses) underlies individuals' willingness to grant rights to companies. These findings show that people (rather than corporations) are more appropriate recipients of rights, and can explain public backlash to judicial expansions of corporate rights.
Full Text Available This article draws a perceptual approach to audio-visual mapping. Clearly perceivable cause and effect relationships can be problematic if one desires the audience to experience the music. Indeed perception would bias those sonic qualities that fit previous concepts of causation, subordinating other sonic qualities, which may form the relations between the sounds themselves. The question is, how can an audio-visual mapping produce a sense of causation, and simultaneously confound the actual cause-effect relationships. We call this a fungible audio-visual mapping. Our aim here is to glean its constitution and aspect. We will report a study, which draws upon methods from experimental psychology to inform audio-visual instrument design and composition. The participants are shown several audio-visual mapping prototypes, after which we pose quantitative and qualitative questions regarding their sense of causation, and their sense of understanding the cause-effect relationships. The study shows that a fungible mapping requires both synchronized and seemingly non-related components – sufficient complexity to be confusing. As the specific cause-effect concepts remain inconclusive, the sense of causation embraces the whole.
Full Text Available This work analyzed the perceptual attributes of natural dynamic audiovisual scenes. We presented thirty participants with 19 natural scenes in a similarity categorization task, followed by a semi-structured interview. The scenes were reproduced with an immersive audiovisual display. Natural scene perception has been studied mainly with unimodal settings, which have identified motion as one of the most salient attributes related to visual scenes, and sound intensity along with pitch trajectories related to auditory scenes. However, controlled laboratory experiments with natural multimodal stimuli are still scarce. Our results show that humans pay attention to similar perceptual attributes in natural scenes, and a two-dimensional perceptual map of the stimulus scenes and perceptual attributes was obtained in this work. The exploratory results show the amount of movement, perceived noisiness, and eventfulness of the scene to be the most important perceptual attributes in naturalistically reproduced real-world urban environments. We found the scene gist properties openness and expansion to remain as important factors in scenes with no salient auditory or visual events. We propose that the study of scene perception should move forward to understand better the processes behind multimodal scene processing in real-world environments. We publish our stimulus scenes as spherical video recordings and sound field recordings in a publicly available database.
Helfer, Karen S.
The present study examined the interactions of age, temporal resolution, hearing loss, and consonant perception under realistic listening conditions. Four subject groups were employed, with N = 8 per group: young adults with normal hearing or sloping sensorineural hearing loss; and elderly individuals with minimal peripheral hearing loss or presbycusis. Copies of the CUNY Nonsense Syllable Test (NST) were re-recorded under four levels of reverberation in quiet and in +10 signal-to-noise ratio of cafeteria noise. The test stimuli were presented binaurally to the subjects via TDH-49 headphones. In addition, a diotic wideband gap detection task, using a 72 dBSPL presentation level, was used as a measure of temporal resolution. Results of the present investigation demonstrated that all other subject groups performed significantly poorer than the young, normal hearing adults. Scores decreased as the amount of distortion increased, although there was very little difference between performance in the two highest reverberation conditions. For reverberation alone, the young, hearing-impaired listeners obtained the lowest scores; for reverberation + noise, the older, hearing-impaired subjects performed poorest. Large within-group variability was noted in the size of the gap thresholds, and only the two groups of young subjects differed significantly on this task of temporal resolution. Correlation analyses demonstrated a strong inverse relation between age and performance in reverberation + noise, even when degree of hearing loss was partialed out. Pure -tone average, as well as NST scores in quiet and in noise alone were related strongly to performance in reverberation. Multiple regression analyses demonstrated that nonsense syllable perception in reverberation could be predicted from a combination of age, pure-tone thresholds, scores in quiet and noise, and gap threshold. Gap threshold was the strongest predictor variable for reverberation alone, while the NST score
LIU Peng; WANG Zuoying
In this paper we address the problem of audio-visual speech recognition in the framework of the multi-stream hidden Markov model. Stream weight training based on minimum classification error criterion is discussed for use in large vocabulary continuous speech recognition (LVCSR). We present the lattice re-scoring and Viterbi approaches for calculating the loss function of continuous speech. The experimental results show that in the case of clean audio, the system performance can be improved by 36.1% in relative word error rate reduction when using state-based stream weights trained by a Viterbi approach, compared to an audio only speech recognition system. Further experimental results demonstrate that our audio-visual LVCSR system provides significant enhancement of robustness in noisy environments.
Hertrich, Ingo; Dietrich, Susanne; Ackermann, Hermann
Blind people can learn to understand speech at ultra-high syllable rates (ca. 20 syllables/s), a capability associated with hemodynamic activation of the central-visual system. To further elucidate the neural mechanisms underlying this skill, magnetoencephalographic (MEG) measurements during listening to sentence utterances were cross-correlated with time courses derived from the speech signal (envelope, syllable onsets and pitch periodicity) to capture phase-locked MEG components (14 blind, 12 sighted subjects; speech rate=8 or 16 syllables/s, pre-defined source regions: auditory and visual cortex, inferior frontal gyrus). Blind individuals showed stronger phase locking in auditory cortex than sighted controls, and right-hemisphere visual cortex activity correlated with syllable onsets in case of ultra-fast speech. Furthermore, inferior-frontal MEG components time-locked to pitch periodicity displayed opposite lateralization effects in sighted (towards right hemisphere) and blind subjects (left). Thus, ultra-fast speech comprehension in blind individuals appears associated with changes in early signal-related processing mechanisms both within and outside the central-auditory terrain.
Riedel, Philipp; Ragert, Patrick; Schelinski, Stefanie; Kiebel, Stefan J; von Kriegstein, Katharina
with the 'auditory-visual view' of auditory speech perception, which assumes that auditory speech recognition is optimized by using predictions from previously encoded speaker-specific audio-visual internal models.
Tamminen, Henna; Peltola, Maija S; Kujala, Teija; Näätänen, Risto
Language-specific, automatically responding memory traces form the basis for speech sound perception and new neural representations can also evolve for non-native speech categories. The aim of this study was to find out how a three-day phonetic listen-and-repeat training affects speech perception, and whether it generates new memory traces. We used behavioural identification, goodness rating, discrimination, and reaction time tasks together with mismatch negativity (MMN) brain response registrations to determine the training effects on native Finnish speakers. We trained the subjects the voicing contrast in fricative sounds. Fricatives are not differentiated by voicing in Finnish, i.e., voiced fricatives do not belong to the Finnish phonological system. Therefore, they are extremely hard for Finns to learn. However, only after three days of training, the native Finnish subjects had learned to perceive the distinction. The results show striking changes in the MMN response; it was significantly larger on the second day after two training sessions. Also, the majority of the behavioural indicators showed improvement during training. Identification altered after four sessions of training and discrimination and reaction times improved throughout training. These results suggest remarkable language-learning effects both at the perceptual and pre-attentive neural level as a result of brief listen-and-repeat training in adult participants.
Full Text Available Though human sensations, such as the senses of hearing, sight, etc., are independent each other, the interference between two of them is sometimes observed, and is called cross-modal perception. Hitherto we studied unimodal perception of visual sensation and auditory sensation respectively by differential geometry. We interpreted the parallel alley and the distance alley as two geodesics under different conditions in a visual space, and depicted the trace of continuous vowel speech as the geodesics through phonemes on a vowel plane. In this work, cross-modal perception is similarly treated from the standpoint of non-Riemannian geometry, where each axis of a cross-modal sensory space represents unimodal sensation. The geometry allows us to treat asymmetric metric tensor and hence a non-Euclidean concept of anholonomic objects, representing unidirectional property of cross-modal perception. The McGurk effect in audiovisual perception and ‘rubber hand’ illusion in visual tactile perception can afford experimental evidence of torsion tensor. The origin of ‘bouncing balls’ illusion is discussed from the standpoint of an audiovisual cross-modal sensory space in a qualitative manner.
Full Text Available Evidence of visual-auditory cross-modal plasticity in deaf individuals has been widely reported. Superior visual abilities of deaf individuals have been shown to result in enhanced reactivity to visual events and/or enhanced peripheral spatial attention. The goal of this study was to investigate the association between visual-auditory cross-modal plasticity and speech perception in post-lingually deafened, adult cochlear implant (CI users. Post-lingually deafened adults with CIs (N = 14 and a group of normal hearing, adult controls (N = 12 participated in this study. The CI participants were divided into a good performer group (good CI, N = 7 and a poor performer group (poor CI, N = 7 based on word recognition scores. Visual evoked potentials (VEP were recorded from the temporal and occipital cortex to assess reactivity. Visual field (VF testing was used to assess spatial attention and Goldmann perimetry measures were analyzed to identify differences across groups in the VF. The association of the amplitude of the P1 VEP response over the right temporal or occipital cortex among three groups (control, good CI, poor CI was analyzed. In addition, the association between VF by different stimuli and word perception score was evaluated. The P1 VEP amplitude recorded from the right temporal cortex was larger in the group of poorly performing CI users than the group of good performers. The P1 amplitude recorded from electrodes near the occipital cortex was smaller for the poor performing group. P1 VEP amplitude in right temporal lobe was negatively correlated with speech perception outcomes for the CI participants (r = -0.736, P = 0.003. However, P1 VEP amplitude measures recorded from near the occipital cortex had a positive correlation with speech perception outcome in the CI participants (r = 0.775, P = 0.001. In VF analysis, CI users showed narrowed central VF (VF to low intensity stimuli. However, their far peripheral VF (VF to high intensity
Robins, Diana L.; Hunyadi, Elinora; Schultz, Robert T.
Perception of emotion is critical for successful social interaction, yet the neural mechanisms underlying the perception of dynamic, audio-visual emotional cues are poorly understood. Evidence from language and sensory paradigms suggests that the superior temporal sulcus and gyrus (STS/STG) play a key role in the integration of auditory and visual…
Rousseau, Isabelle; Onslow, Mark; Packman, Ann; Jones, Mark
Purpose: To determine whether measures of stuttering frequency and measures of overall stuttering severity in preschoolers differ when made from audio-only recordings compared with audiovisual recordings. Method: Four blinded speech-language pathologists who had extensive experience with preschoolers who stutter measured stuttering frequency and…
Full Text Available There is a wide range of acoustic and visual variability across different talkers and different speaking contexts. Listeners with normal hearing accommodate that variability in ways that facilitate efficient perception, but it is not known whether listeners with cochlear implants can do the same. In this study, listeners with normal hearing (NH and listeners with cochlear implants (CIs were tested for accommodation to auditory and visual phonetic contexts created by gender-driven speech differences as well as vowel coarticulation and lip rounding in both consonants and vowels. Accommodation was measured as the shifting of perceptual boundaries between /s/ and /ʃ/ sounds in various contexts, as modeled by mixed-effects logistic regression. Owing to the spectral contrasts thought to underlie these context effects, CI listeners were predicted to perform poorly, but showed considerable success. Listeners with cochlear implants not only showed sensitivity to auditory cues to gender, they were also able to use visual cues to gender (i.e. faces as a supplement or proxy for information in the acoustic domain, in a pattern that was not observed for listeners with normal hearing. Spectrally-degraded stimuli heard by listeners with normal hearing generally did not elicit strong context effects, underscoring the limitations of noise vocoders and/or the importance of experience with electric hearing. Visual cues for consonant lip rounding and vowel lip rounding were perceived in a manner consistent with coarticulation and were generally used more heavily by listeners with CIs. Results suggest that listeners with cochlear implants are able to accommodate various sources of acoustic variability either by attending to appropriate acoustic cues or by inferring them via the visual signal.
Albouy, Philippe; Lévêque, Yohana; Hyde, Krista L; Bouchet, Patrick; Tillmann, Barbara; Caclin, Anne
The combination of information across senses can enhance perception, as revealed for example by decreased reaction times or improved stimulus detection. Interestingly, these facilitatory effects have been shown to be maximal when responses to unisensory modalities are weak. The present study investigated whether audiovisual facilitation can be observed in congenital amusia, a music-specific disorder primarily ascribed to impairments of pitch processing. Amusic individuals and their matched controls performed two tasks. In Task 1, they were required to detect auditory, visual, or audiovisual stimuli as rapidly as possible. In Task 2, they were required to detect as accurately and as rapidly as possible a pitch change within an otherwise monotonic 5-tone sequence that was presented either only auditorily (A condition), or simultaneously with a temporally congruent, but otherwise uninformative visual stimulus (AV condition). Results of Task 1 showed that amusics exhibit typical auditory and visual detection, and typical audiovisual integration capacities: both amusics and controls exhibited shorter response times for audiovisual stimuli than for either auditory stimuli or visual stimuli. Results of Task 2 revealed that both groups benefited from simultaneous uninformative visual stimuli to detect pitch changes: accuracy was higher and response times shorter in the AV condition than in the A condition. The audiovisual improvements of response times were observed for different pitch interval sizes depending on the group. These results suggest that both typical listeners and amusic individuals can benefit from multisensory integration to improve their pitch processing abilities and that this benefit varies as a function of task difficulty. These findings constitute the first step towards the perspective to exploit multisensory paradigms to reduce pitch-related deficits in congenital amusia, notably by suggesting that audiovisual paradigms are effective in an appropriate
Desantis, Andrea; Haggard, Patrick
To form a coherent representation of the objects around us, the brain must group the different sensory features composing these objects. Here, we investigated whether actions contribute in this grouping process. In particular, we assessed whether action-outcome learning and prediction contribute to audiovisual temporal binding. Participants were presented with two audiovisual pairs: one pair was triggered by a left action, and the other by a right action. In a later test phase, the audio and visual components of these pairs were presented at different onset times. Participants judged whether they were simultaneous or not. To assess the role of action-outcome prediction on audiovisual simultaneity, each action triggered either the same audiovisual pair as in the learning phase ('predicted' pair), or the pair that had previously been associated with the other action ('unpredicted' pair). We found the time window within which auditory and visual events appeared simultaneous increased for predicted compared to unpredicted pairs. However, no change in audiovisual simultaneity was observed when audiovisual pairs followed visual cues, rather than voluntary actions. This suggests that only action-outcome learning promotes temporal grouping of audio and visual effects. In a second experiment we observed that changes in audiovisual simultaneity do not only depend on our ability to predict what outcomes our actions generate, but also on learning the delay between the action and the multisensory outcome. When participants learned that the delay between action and audiovisual pair was variable, the window of audiovisual simultaneity for predicted pairs increased, relative to a fixed action-outcome pair delay. This suggests that participants learn action-based predictions of audiovisual outcome, and adapt their temporal perception of outcome events based on such predictions.
Reetzke, Rachel; Lam, Boji Pak-Wing; Xie, Zilong; Sheng, Li; Chandrasekaran, Bharath
Recognizing speech in adverse listening conditions is a significant cognitive, perceptual, and linguistic challenge, especially for children. Prior studies have yielded mixed results on the impact of bilingualism on speech perception in noise. Methodological variations across studies make it difficult to converge on a conclusion regarding the effect of bilingualism on speech-in-noise performance. Moreover, there is a dearth of speech-in-noise evidence for bilingual children who learn two languages simultaneously. The aim of the present study was to examine the extent to which various adverse listening conditions modulate differences in speech-in-noise performance between monolingual and simultaneous bilingual children. To that end, sentence recognition was assessed in twenty-four school-aged children (12 monolinguals; 12 simultaneous bilinguals, age of English acquisition ≤ 3 yrs.). We implemented a comprehensive speech-in-noise battery to examine recognition of English sentences across different modalities (audio-only, audiovisual), masker types (steady-state pink noise, two-talker babble), and a range of signal-to-noise ratios (SNRs; 0 to -16 dB). Results revealed no difference in performance between monolingual and simultaneous bilingual children across each combination of modality, masker, and SNR. Our findings suggest that when English age of acquisition and socioeconomic status is similar between groups, monolingual and bilingual children exhibit comparable speech-in-noise performance across a range of conditions analogous to everyday listening environments. PMID:27936212
Rosana Maria Tristão
Full Text Available A fala humana é um som de grande complexidade, cujo processamento perceptual, produção e relações com a linguagem e a cognição necessitam de uma análise integrada, tanto do ponto de vista do conhecimento disponível como também das especificidades metodológicas. Neste artigo faz-se uma breve revisão da literatura sobre as principais aquisições e desenvolvimento da linguagem no primeiro ano de vida de bebês com desenvolvimento normal com enfoque na percepção da fala humana. Busca-se, também, analisar a ocorrência de distúrbios auditivos que podem causar alterações na percepção da fala, com possíveis implicações para o desenvolvimento pré-lingüístico. Atenção especial é dada ao desenvolvimento da habilidade de percepção de fala e de linguagem em bebês com síndrome de Down. É analisada a predisposição, nesta população, a problemas audiológicos, sua relação com alterações no desenvolvimento de linguagem, e a tendência apresentada no primeiro ano de vida para padrões diferenciados de atenção à fala.Human speech is a highly complex sound; whose perceptual processing, production and relations to language and cognition require an integrated analysis, not only from the viewpoint of available knowledge but also of its methodological specificities. This article presents a brief review of the literature on the main acquisitions and development of language in the first year of life of normally developing infants, with emphasis on speech perception. One also analyzes the occurrence of auditory disturbances in the first year of life that could jeopardize speech perception, with possible implications for pre-linguistic development. Special attention is give to the development of speech perception and language in Down syndrome infants. The predisposition to audiologic problems, its relation to impairment in the development of language, and the tendency presented in the first year of life of differential patterns
Wagner, M.; Tran, D.; Togneri, R.; Rose, P.; Powers, D.M.; Onslow, M.; Loakes, D.E.; Lewis, T.W.; Kuratate, T.; Kinoshita, Y.; Kemp, N.; Ishihara, S.; Ingram, J.C.; Hajek, J.T.; Grayden, D.B.; Goecke, R.; Fletcher, J.M.; Estival, D.; Epps, J.R.; Dale, R.; Cutler, A.; Cox, F.M.; Chetty, G.; Cassidy, S.; Butcher, A.R.; Burnham, D.; Bird, S.; Best, C.T.; Bennamoun, M.; Arciuli, J.; Ambikairajah, E.
Under an ARC Linkage Infrastructure, Equipment and Facilities (LIEF) grant, speech science and technology experts from across Australia have joined forces to organise the recording of audio-visual (AV) speech data from representative speakers of Australian English in all capital cities and some regi
Williams, Justin H. G.; Massaro, Dominic W.; Peel, Natalie J.; Bosseler, Alexis; Suddendorf, Thomas
Children with autistic spectrum disorder (ASD) may have poor audio-visual integration, possibly reflecting dysfunctional "mirror neuron" systems which have been hypothesised to be at the core of the condition. In the present study, a computer program, utilizing speech synthesizer software and a "virtual" head (Baldi), delivered speech stimuli for…
Locsei, Gusztav; Pedersen, Julie Hefting; Laugesen, Søren;
This study investigated the role of temporal fine structure (TFS) coding in spatially complex, lateralized listening tasks. Speech reception thresholds (SRTs) were measured in young normal-hearing (NH) and two groups of elderly hearing-impaired (HI) listeners in the presence of speech-shaped noise...
Terfa T. Alakali
Full Text Available This paper examined the phenomenon of hate speech and foul language on social media platforms in Nigeria, and assessed their moral and legal consequences in the society and to journalism practice. It used both quantitative and qualitative methodology to investigate the phenomenon. In the first place, the paper employed the survey research methodology to sample 384 respondents using questionnaire and focus group discussion as instruments for data collection. Findings from the research indicate that promoting hate speech and foul language on social media have moral and legal consequences in the society and to journalism practice. Findings also show that although, the respondents understand that hate speech and foul language attract legal consequences, they do not know what obligations are created by law against perpetrators of hate speech and foul language in Nigeria. The paper therefore, adopted the qualitative, doctrinal and analytical methodology to discuss the legal consequences and obligations created against perpetrators of hate speech and foul language in Nigeria. The paper concluded based on the findings that hate speech and foul language is prevalent on social media platforms in Nigeria and that there are adequate legal provisions to curb the phenomenon in Nigeria. It recommends among others things that the Nigerian government and NGOs should sponsor monitoring projects like the UMATI in Kenya to better understand the use of hate speech and that monitoring agencies set up under the legal regime should adopt mechanisms to identify and remove hate speech content on social media platforms in Nigeria.
Rudnev, V A; Shteĭnerdt, V V
The method of "speech donorship" is based on genetically mediated factors of tempo-rhythmic concordance of speech in monozygotic twins (co-twins) and pairs of close relatives (father-son, mother-daughter, sibs). Recording of the natural audiovisual donor sample of the speech adapted for a structurally-linguistic condition of speech of the recipient was carried out on a digital movie camera. This sample is defined using the data of computer transformation obtained by the program specially developed by the authors. The program allows to compute time equivalents of three parameters: the time spent for realization of "word", "pause", "word + pauses". Work of the recipient with the screen donor sample assumes a support of the restoration of genetic and adaptive speech patterns. Then the recipient works with the own audiovisual sample. The dictionary of a family speech was used to build tests. The use of this method was described for 15 patients with aphasia of vascular and traumatic etiology.
Li, Qi; Yu, Hongtao; Wu, Yan; Gao, Ning
The integration of multiple sensory inputs is essential for perception of the external world. The spatial factor is a fundamental property of multisensory audiovisual integration. Previous studies of the spatial constraints on bimodal audiovisual integration have mainly focused on the spatial congruity of audiovisual information. However, the effect of spatial reliability within audiovisual information on bimodal audiovisual integration remains unclear. In this study, we used event-related potentials (ERPs) to examine the effect of spatial reliability of task-irrelevant sounds on audiovisual integration. Three relevant ERP components emerged: the first at 140-200ms over a wide central area, the second at 280-320ms over the fronto-central area, and a third at 380-440ms over the parieto-occipital area. Our results demonstrate that ERP amplitudes elicited by audiovisual stimuli with reliable spatial relationships are larger than those elicited by stimuli with inconsistent spatial relationships. In addition, we hypothesized that spatial reliability within an audiovisual stimulus enhances feedback projections to the primary visual cortex from multisensory integration regions. Overall, our findings suggest that the spatial linking of visual and auditory information depends on spatial reliability within an audiovisual stimulus and occurs at a relatively late stage of processing.
Full Text Available Emotions are central to our perception of the environment surrounding us (Berlyne, 1971. An important aspect in the emotional response to a sound is dependent on the meaning of the sound, ie, it is not the physical parameter per se that determines our emotional response to the sound but rather the source of the sound (Genell, 2008, and the relevance it has to the self (Tajadura-Jiménez et al 2010. When exposed to sound together with visual information, the information from both modalities is integrated, altering the perception of each modality, in order to generate a coherent experience. In emotional information this integration is rapid and without requirements of attentional processes (De Gelder, 1999. The present experiment investigates perception of pink noise in two visual settings in a within-subjects design. Nineteen participants rated the same sound twice in terms of pleasantness and arousal in either a pleasant or an unpleasant visual setting. The results showed that pleasantness of the sound decreased in the negative visual setting, thus suggesting an audio-visual integration, where the affective information in the visual modality is translated to the auditory modality when information-markers are lacking in it. The results are discussed in relation to theories of emotion perception.
Concepto de documento audiovisual y de documentación audiovisual, profundizando en la distinción de documentación de imagen en movimiento con posible incorporación de sonido frente al concepto de documentación audiovisual según plantea Jorge Caldera. Diferenciación entre documentos audiovisuales, obras audiovisuales y patrimonio audiovisual según Félix del Valle.
The relationship between speech perception and production has been debated for a long time. The Motor Theory of speech perception (Liberman et al., 1989) claims that perceiving speech is identifying the intended articulatory gestures rather than perceiving the sound patterns. It seems to suggest that speech production precedes speech perception,…
The Chinese audiovisual market is to impose a ban on audiovisual product dealers whose licenses have been revoked for violatingthe law. This ban will prohibit them from dealing in audiovisual products for ten years. Their names are to be included on a blacklist made known to the public.
A history professor relates his experiences producing and using audio-visual material and warns teachers not to rely on audio-visual aids for classroom presentations. Includes examples of popular audio-visual aids on Canada that communicate unintended, inaccurate, or unclear ideas. Urges teachers to exercise caution in the selection and use of…
Morand, J J
The author presents a list of the audio-visual productions about Tropical Medicine, as well as of their main characteristics. He thinks that the audio-visual educational productions are often dissociated from their promotion; therefore, he invites the future creator to forward his work to the Audio-Visual Health Committee.
Law, Sam-Po; Fung, Roxana; Kung, Carmen
This study investigated a theoretically challenging dissociation between good production and poor perception of tones among neurologically unimpaired native speakers of Cantonese. The dissociation is referred to as the near-merger phenomenon in sociolinguistic studies of sound change. In a passive oddball paradigm, lexical and nonlexical syllables of the T1/T6 and T4/T6 contrasts were presented to elicit the mismatch negativity (MMN) and P3a from two groups of participants, those who could produce and distinguish all tones in the language (Control) and those who could produce all tones but specifically failed to distinguish between T4 and T6 in perception (Dissociation). The presence of MMN to T1/T6 and null response to T4/T6 of lexical syllables in the dissociation group confirmed the near-merger phenomenon. The observation that the control participants exhibited a statistically reliable MMN to lexical syllables of T1/T6, weaker responses to nonlexical syllables of T1/T6 and lexical syllables of T4/T6, and finally null response to nonlexical syllables of T4/T6, suggests the involvement of top-down processing in speech perception. Furthermore, the stronger P3a response of the control group, compared with the dissociation group in the same experimental conditions, may be taken to indicate higher cognitive capability in attention switching, auditory attention or memory in the control participants. This cognitive difference, together with our speculation that constant top-down predictions without complete bottom-up analysis of acoustic signals in speech recognition may reduce one’s sensitivity to small acoustic contrasts, account for the occurrence of dissociation in some individuals but not others. PMID:23342146
Law, Sam-Po; Fung, Roxana; Kung, Carmen
This study investigated a theoretically challenging dissociation between good production and poor perception of tones among neurologically unimpaired native speakers of Cantonese. The dissociation is referred to as the near-merger phenomenon in sociolinguistic studies of sound change. In a passive oddball paradigm, lexical and nonlexical syllables of the T1/T6 and T4/T6 contrasts were presented to elicit the mismatch negativity (MMN) and P3a from two groups of participants, those who could produce and distinguish all tones in the language (Control) and those who could produce all tones but specifically failed to distinguish between T4 and T6 in perception (Dissociation). The presence of MMN to T1/T6 and null response to T4/T6 of lexical syllables in the dissociation group confirmed the near-merger phenomenon. The observation that the control participants exhibited a statistically reliable MMN to lexical syllables of T1/T6, weaker responses to nonlexical syllables of T1/T6 and lexical syllables of T4/T6, and finally null response to nonlexical syllables of T4/T6, suggests the involvement of top-down processing in speech perception. Furthermore, the stronger P3a response of the control group, compared with the dissociation group in the same experimental conditions, may be taken to indicate higher cognitive capability in attention switching, auditory attention or memory in the control participants. This cognitive difference, together with our speculation that constant top-down predictions without complete bottom-up analysis of acoustic signals in speech recognition may reduce one's sensitivity to small acoustic contrasts, account for the occurrence of dissociation in some individuals but not others.
Full Text Available Visual allocentric references frames (contextual cues affect visual space perception (Diedrichsen et al., 2004; Walter et al., 2006. On the other hand, experiments have shown a change of visual perception induced by binaural stimuli (Chandler, 1961; Carlile et al., 2001. In the present study we investigate the effect of visual and audiovisual allocentred reference frame on visual localization and straight ahead pointing. Participant faced a black part-spherical screen (92cm radius. The head was maintained aligned with the body. Participant wore headphone and a glove with motion capture markers. A red laser point was displayed straight ahead as fixation point. The visual target was a 100ms green laser point. After a short delay, the green laser reappeared and participant had to localize target with a trackball. Straight ahead blind pointing was required before and after series of 48 trials. Visual part of the bimodal allocentred reference frame was provided by a vertical red laser line (15° left or 15° right, auditory part was provided by 3D sound. Five conditions were tested, no-reference, visual reference (left/right, audiovisual reference (left/right. Results show that the significant effect of bimodal audiovisual reference is not different from the visual reference one.
Locsei, Gusztav; Pedersen, Julie Hefting; Laugesen, Søren;
hearing loss above 1.5 kHz participated in the study. Speech reception thresholds (SRTs) were estimated in the presence of either speech-shaped noise, two-, four-, or eight-talker babble played reversed, or a nonreversed two-talker masker. Target audibility was ensured by applying individualized linear...... understanding in spatially complex environments, these limitations were unrelated to TFS coding abilities and were only weakly associated with a reduction in binaural-unmasking benefit for spatially separated competing sources....
This research aimed to improve the skill in appreciating dances owned by the students of Primary Teacher Education of Makassar State University, to improve the perception towards audio-visual based art appreciation, to increase the students’ interest in audio-visual based art education subject, and to increase the students’ responses to the subject. This research was classroom action research using the research design created by Kemmis & MC. Taggart, which was conducted to 42 students of Prim...
Documentary makers, journalists, news editors, and other media professionals routinely require previously recorded audiovisual material for new productions. For example, a news editor might wish to reuse footage from overseas services for the evening news, or a documentary maker describing the histo
Most, Tova; Harel, Tamar; Shpak, Talma; Luntz, Michal
Purpose: The purpose of the study was to evaluate the contribution of acoustic hearing to the perception of suprasegmental features by adults who use a cochlear implant (CI) and a hearing aid (HA) in opposite ears. Method: 23 adults participated in this study. Perception of suprasegmental features--intonation, syllable stress, and word…
Hakvoort, Britt; van der Leij, Aryan; van Setten, Ellie; Maurits, Natasha; Maassen, Ben; van Zuijen, Titia
Atypical language lateralization has been marked as one of the factors that may contribute to the development of dyslexia. Indeed, atypical lateralization of linguistic functions such as speech processing in dyslexia has been demonstrated using neuroimaging studies, but also using the behavioral dic
Wilson, Leanne; McNeill, Brigid; Gillon, Gail T.
Successful collaboration among speech and language therapists (SLTs) and teachers fosters the creation of communication friendly classrooms that maximize children's spoken and written language learning. However, these groups of professionals may have insufficient opportunity in their professional study to develop the shared knowledge, perceptions…
Full Text Available Speech perception for both hearing and deaf people involves an integrative process between auditory and lip-reading information. In order to disambiguate information from lips, manual cues from Cued Speech may be added. Cued Speech (CS is a system of manual aids developed to help deaf people to clearly and completely understand speech visually (Cornett, 1967. Within this system, both labial and manual information, as lone input sources, remain ambiguous. Perceivers, therefore, have to combine both types of information in order to get one coherent percept. In this study, we examined how audio-visual (AV integration is affected by the presence of manual cues and on which form of information (auditory, labial or manual the CS receptors primarily rely. To address this issue, we designed a unique experiment that implemented the use of AV McGurk stimuli (audio /pa/ and lip-reading /ka/ which were produced with or without manual cues. The manual cue was congruent with either auditory information, lip information or the expected fusion. Participants were asked to repeat the perceived syllable aloud. Their responses were then classified into four categories: audio (when the response was /pa/, lip-reading (when the response was /ka/, fusion (when the response was /ta/ and other (when the response was something other than /pa/, /ka/ or /ta/. Data were collected from hearing impaired individuals who were experts in CS (all of which had either cochlear implants or binaural hearing aids; N=8, hearing-individuals who were experts in CS (N = 14 and hearing-individuals who were completely naïve of CS (N = 15. Results confirmed that, like hearing-people, deaf people can merge auditory and lip-reading information into a single unified percept. Without manual cues, McGurk stimuli induced the same percentage of fusion responses in both groups. Results also suggest that manual cues can modify the AV integration and that their impact differs between hearing and deaf
Bayard, Clémence; Colin, Cécile; Leybaert, Jacqueline
Speech perception for both hearing and deaf people involves an integrative process between auditory and lip-reading information. In order to disambiguate information from lips, manual cues from Cued Speech may be added. Cued Speech (CS) is a system of manual aids developed to help deaf people to clearly and completely understand speech visually (Cornett, 1967). Within this system, both labial and manual information, as lone input sources, remain ambiguous. Perceivers, therefore, have to combine both types of information in order to get one coherent percept. In this study, we examined how audio-visual (AV) integration is affected by the presence of manual cues and on which form of information (auditory, labial or manual) the CS receptors primarily rely. To address this issue, we designed a unique experiment that implemented the use of AV McGurk stimuli (audio /pa/ and lip-reading /ka/) which were produced with or without manual cues. The manual cue was congruent with either auditory information, lip information or the expected fusion. Participants were asked to repeat the perceived syllable aloud. Their responses were then classified into four categories: audio (when the response was /pa/), lip-reading (when the response was /ka/), fusion (when the response was /ta/) and other (when the response was something other than /pa/, /ka/ or /ta/). Data were collected from hearing impaired individuals who were experts in CS (all of which had either cochlear implants or binaural hearing aids; N = 8), hearing-individuals who were experts in CS (N = 14) and hearing-individuals who were completely naïve of CS (N = 15). Results confirmed that, like hearing-people, deaf people can merge auditory and lip-reading information into a single unified percept. Without manual cues, McGurk stimuli induced the same percentage of fusion responses in both groups. Results also suggest that manual cues can modify the AV integration and that their impact differs between hearing and deaf people.
Yuan, Xiangyong; Bi, Cuihua; Huang, Xiting
Out-of-synchrony experiences can easily recalibrate one's subjective simultaneity point in the direction of the experienced asynchrony. Although temporal adjustment of multiple audiovisual stimuli has been recently demonstrated to be spatially specific, perceptual grouping processes that organize separate audiovisual stimuli into distinctive "objects" may play a more important role in forming the basis for subsequent multiple temporal recalibrations. We investigated whether apparent physical differences between audiovisual pairs that make them distinct from each other can independently drive multiple concurrent temporal recalibrations regardless of spatial overlap. Experiment 1 verified that reducing the physical difference between two audiovisual pairs diminishes the multiple temporal recalibrations by exposing observers to two utterances with opposing temporal relationships spoken by one single speaker rather than two distinct speakers at the same location. Experiment 2 found that increasing the physical difference between two stimuli pairs can promote multiple temporal recalibrations by complicating their non-temporal dimensions (e.g., disks composed of two rather than one attribute and tones generated by multiplying two frequencies); however, these recalibration aftereffects were subtle. Experiment 3 further revealed that making the two audiovisual pairs differ in temporal structures (one transient and one gradual) was sufficient to drive concurrent temporal recalibration. These results confirm that the more audiovisual pairs physically differ, especially in temporal profile, the more likely multiple temporal perception adjustments will be content-constrained regardless of spatial overlap. These results indicate that multiple temporal recalibrations are based secondarily on the outcome of perceptual grouping processes.
Full Text Available Se presenta el desarrollo de un sistema automático de reconocimiento audiovisual del habla enfocado en el reconocimiento de comandos. La representación del audio se realizó mediante los coeficientes cepstrales de Mel y las primeras dos derivadas temporales. Para la caracterización del vídeo se hizo seguimiento automático de características visuales de alto nivel a través de toda la secuencia. Para la inicialización automática del algoritmo se emplearon transformaciones de color y contornos activos con información de flujo del vector gradiente ("GVF snakes" sobre la región labial, mientras que para el seguimiento se usaron medidas de similitud entre vecindarios y restricciones morfológicas definidas en el estándar MPEG-4. Inicialmente, se presenta el diseño del sistema de reconocimiento automático del habla, empleando únicamente información de audio (ASR, mediante Modelos Ocultos de Markov (HMMs y un enfoque de palabra aislada; posteriormente, se muestra el diseño de los sistemas empleando únicamente características de vídeo (VSR, y empleando características de audio y vídeo combinadas (AVSR. Al final se comparan los resultados de los tres sistemas para una base de datos propia en español y francés, y se muestra la influencia del ruido acústico, mostrando que el sistema de AVSR es más robusto que ASR y VSR.We present the development of an automatic audiovisual speech recognition system focused on the recognition of commands. Signal audio representation was done using Mel cepstral coefficients and their first and second order time derivatives. In order to characterize the video signal, a set of high-level visual features was tracked throughout the sequences. Automatic initialization of the algorithm was performed using color transformations and active contour models based on Gradient Vector Flow (GVF Snakes on the lip region, whereas visual tracking used similarity measures across neighborhoods and morphological
section includes four carefully selected chapters. They deal with facets of speech production, speech acoustics, and/or speech perception or recognition, place them in an integrated phonetic-phonological perspective, and relate them in more or less explicit ways to aspects of speech technology. Therefore......, we hope that this volume can help speech scientists with traditional training in phonetics and phonology to keep up with the latest developments in speech technology. In the opposite direction, speech researchers starting from a technological perspective will hopefully get inspired by reading about...... the questions, phenomena, and communicative functions that are currently addressed in phonetics and phonology. Either way, the future of speech research lies in international, interdisciplinary collaborations, and our volume is meant to reflect and facilitate such collaborations...
Cecere, Roberto; Gross, Joachim; Thut, Gregor
The ability to integrate auditory and visual information is critical for effective perception and interaction with the environment, and is thought to be abnormal in some clinical populations. Several studies have investigated the time window over which audiovisual events are integrated, also called the temporal binding window, and revealed asymmetries depending on the order of audiovisual input (i.e. the leading sense). When judging audiovisual simultaneity, the binding window appears narrower and non-malleable for auditory-leading stimulus pairs and wider and trainable for visual-leading pairs. Here we specifically examined the level of independence of binding mechanisms when auditory-before-visual vs. visual-before-auditory input is bound. Three groups of healthy participants practiced audiovisual simultaneity detection with feedback, selectively training on auditory-leading stimulus pairs (group 1), visual-leading stimulus pairs (group 2) or both (group 3). Subsequently, we tested for learning transfer (crossover) from trained stimulus pairs to non-trained pairs with opposite audiovisual input. Our data confirmed the known asymmetry in size and trainability for auditory-visual vs. visual-auditory binding windows. More importantly, practicing one type of audiovisual integration (e.g. auditory-visual) did not affect the other type (e.g. visual-auditory), even if trainable by within-condition practice. Together, these results provide crucial evidence that audiovisual temporal binding for auditory-leading vs. visual-leading stimulus pairs are independent, possibly tapping into different circuits for audiovisual integration due to engagement of different multisensory sampling mechanisms depending on leading sense. Our results have implications for informing the study of multisensory interactions in healthy participants and clinical populations with dysfunctional multisensory integration.
Arrighi, Roberto; Marini, Francesco; Burr, David
Robust perception requires efficient integration of information from our various senses. Much recent electrophysiology points to neural areas responsive to multisensory stimulation, particularly audiovisual stimulation. However, psychophysical evidence for functional integration of audiovisual motion has been ambiguous. In this study we measure perception of an audiovisual form of biological motion, tap dancing. The results show that the audio tap information interacts with visual motion information, but only when in synchrony, demonstrating a functional combination of audiovisual information in a natural task. The advantage of multimodal combination was better than the optimal maximum likelihood prediction.
Full Text Available In contrast with for example audiovisual speech, the relation between visual and auditory properties of letters and speech sounds is artificial and learned only by explicit instruction. The arbitrariness of the audiovisual link together with the widespread usage of letter – speech sound pairs in alphabetic languages makes those audiovisual objects a unique subject for cross-modal research. Brain imaging evidence has indicated that heteromodal areas in superior temporal, as well as modality-specific auditory cortex are involved in letter – speech sound processing. The role of low level visual areas, however, remains unclear. In this study the visual counterpart of the auditory mismatch negativity (MMN is used to investigate the influences of speech sounds on letter processing. Letter and non-letter deviants were infrequently presented in a train of standard letters, either in isolation or simultaneously with speech sounds. Although previous findings showed that letters systematically modulate speech sound processing (reflected by auditory MMN amplitude modulation, the reverse does not seem to hold: our results did not show evidence for an automatic influence of speech sounds on letter processing (no visual MMN amplitude modulation. This apparent asymmetric recruitment of low level sensory cortices during letter - speech sound processing, contrasts with the symmetric involvement of these cortices in audiovisual speech processing, and is possibly due to the arbitrary nature of the link between letters and speech sounds.
Froyen, Dries; van Atteveldt, Nienke; Blomert, Leo
In contrast with for example audiovisual speech, the relation between visual and auditory properties of letters and speech sounds is artificial and learned only by explicit instruction. The arbitrariness of the audiovisual link together with the widespread usage of letter–speech sound pairs in alphabetic languages makes those audiovisual objects a unique subject for crossmodal research. Brain imaging evidence has indicated that heteromodal areas in superior temporal, as well as modality-specific auditory cortex are involved in letter–speech sound processing. The role of low level visual areas, however, remains unclear. In this study the visual counterpart of the auditory mismatch negativity (MMN) is used to investigate the influences of speech sounds on letter processing. Letter and non-letter deviants were infrequently presented in a train of standard letters, either in isolation or simultaneously with speech sounds. Although previous findings showed that letters systematically modulate speech sound processing (reflected by auditory MMN amplitude modulation), the reverse does not seem to hold: our results did not show evidence for an automatic influence of speech sounds on letter processing (no visual MMN amplitude modulation). This apparent asymmetric recruitment of low level sensory cortices during letter–speech sound processing, contrasts with the symmetric involvement of these cortices in audiovisual speech processing, and is possibly due to the arbitrary nature of the link between letters and speech sounds. PMID:20428501
Froyen, Dries; van Atteveldt, Nienke; Blomert, Leo
In contrast with for example audiovisual speech, the relation between visual and auditory properties of letters and speech sounds is artificial and learned only by explicit instruction. The arbitrariness of the audiovisual link together with the widespread usage of letter-speech sound pairs in alphabetic languages makes those audiovisual objects a unique subject for crossmodal research. Brain imaging evidence has indicated that heteromodal areas in superior temporal, as well as modality-specific auditory cortex are involved in letter-speech sound processing. The role of low level visual areas, however, remains unclear. In this study the visual counterpart of the auditory mismatch negativity (MMN) is used to investigate the influences of speech sounds on letter processing. Letter and non-letter deviants were infrequently presented in a train of standard letters, either in isolation or simultaneously with speech sounds. Although previous findings showed that letters systematically modulate speech sound processing (reflected by auditory MMN amplitude modulation), the reverse does not seem to hold: our results did not show evidence for an automatic influence of speech sounds on letter processing (no visual MMN amplitude modulation). This apparent asymmetric recruitment of low level sensory cortices during letter-speech sound processing, contrasts with the symmetric involvement of these cortices in audiovisual speech processing, and is possibly due to the arbitrary nature of the link between letters and speech sounds.
Kovyazina, M.S.; Khokhlov, N. A.; Morozova, N. V.
This article discusses the connection of hemispheric control over audioverbal perception processes and such individual features as “leading hand” (right-handedness and left-handedness). We present a literature review and description of our research to provide evidence of the complexity and ambiguity of this connection. The method of dichotic listening was used for diagnosing audioverbal perception lateralization. This method allows estimation of the right-ear coefficient (REC), the efficiency...
Hausen, Maija; Torppa, Ritva; Salmela, Viljami R; Vainio, Martti; Särkämö, Teppo
Disorders of music and speech perception, known as amusia and aphasia, have traditionally been regarded as dissociated deficits based on studies of brain damaged patients. This has been taken as evidence that music and speech are perceived by largely separate and independent networks in the brain. However, recent studies of congenital amusia have broadened this view by showing that the deficit is associated with problems in perceiving speech prosody, especially intonation and emotional prosody. In the present study the association between the perception of music and speech prosody was investigated with healthy Finnish adults (n = 61) using an on-line music perception test including the Scale subtest of Montreal Battery of Evaluation of Amusia (MBEA) and Off-Beat and Out-of-key tasks as well as a prosodic verbal task that measures the perception of word stress. Regression analyses showed that there was a clear association between prosody perception and music perception, especially in the domain of rhythm perception. This association was evident after controlling for music education, age, pitch perception, visuospatial perception, and working memory. Pitch perception was significantly associated with music perception but not with prosody perception. The association between music perception and visuospatial perception (measured using analogous tasks) was less clear. Overall, the pattern of results indicates that there is a robust link between music and speech perception and that this link can be mediated by rhythmic cues (time and stress).
Full Text Available Disorders of music and speech perception, known as amusia and aphasia, have traditionally been regarded as dissociated deficits based on studies of brain damaged patients. This has been taken as evidence that music and speech are perceived by largely separate and independent networks in the brain. However, recent studies of congenital amusia have broadened this view by showing that the deficit is associated with problems in perceiving speech prosody, especially intonation and emotional prosody. In the present study the association between the perception of music and speech prosody was investigated with healthy Finnish adults (n = 61 using an on-line music perception test including the Scale subtest of Montreal Battery of Evaluation of Amusia (MBEA and Off-Beat and Out-of-key tasks as well as a prosodic verbal task that measures the perception of word stress. Regression analyses showed that there was a clear association between prosody perception and music perception, especially in the domain of rhythm perception. This association was evident after controlling for music education, age, pitch perception, visuospatial perception and working memory. Pitch perception was significantly associated with music perception but not with prosody perception. The association between music perception and visuospatial perception (measured using analogous tasks was less clear. Overall, the pattern of results indicates that there is a robust link between music and speech perception and that this link can be mediated by rhythmic cues (time and stress.
Audiovisual quality assessment is one of the major challenges in multimedia communications. Traditionally, algorithm-based (objective) assessment methods have focused primarily on the compression artifacts. However, compression is only one of the numerous factors influencing the perception....... In communications applications, transmission errors, including packet losses and bit errors, can be a significant source of quality degradation. Also the environmental factors, such as background noise, ambient light and display characteristics, pose an impact on perception. A third aspect that has not been widely...
The aim of this study was to investigate neural dynamics of audiovisual temporal fusion processes in 6-month-old infants using event-related brain potentials (ERPs). In a habituation-test paradigm, infants did not show any behavioral signs of discrimination of an audiovisual asynchrony of 200 ms, indicating perceptual fusion. In a subsequent EEG experiment, audiovisual synchronous stimuli and stimuli with a visual delay of 200 ms were presented in random order. In contrast to the behavioral data, brain activity differed significantly between the two conditions. Critically, N1 and P2 latency delays were not observed between synchronous and fused items, contrary to previously observed N1 and P2 latency delays between synchrony and perceived asynchrony. Hence, temporal interaction processes in the infant brain between the two sensory modalities varied as a function of perceptual fusion versus asynchrony perception. The visual recognition components Pb and Nc were modulated prior to sound onset, emphasizing the importance of anticipatory visual events for the prediction of auditory signals. Results suggest mechanisms by which young infants predictively adjust their ongoing neural activity to the temporal synchrony relations to be expected between vision and audition.
Full Text Available Audiovisual perception of emotions has been typically examined using displays of a solitary character (e.g. the face-voice and/or body-sound of one actor. However, in real life humans often face more complex multisensory social situations, involving more than one person. Here we ask if the audiovisual facilitation in emotion recognition previously found in simpler social situations extends to more complex and ecological situations. Stimuli consisting of the biological motion and voice of two interacting agents were used in two experiments. In Experiment 1, participants were presented with visual, auditory, auditory filtered/noisy, and audiovisual congruent and incongruent clips. We asked participants to judge whether the two agents were interacting happily or angrily. In Experiment 2, another group of participants repeated the same task, as in Experiment 1, while trying to ignore either the visual or the auditory information. The findings from both experiments indicate that when the reliability of the auditory cue was decreased participants weighted more the visual cue in their emotional judgments. This in turn translated in increased emotion recognition accuracy for the multisensory condition. Our findings thus point to a common mechanism of multisensory integration of emotional signals irrespective of social stimulus complexity.
Rachel N Denison
Full Text Available Synchrony between events in different senses has long been considered the critical temporal cue for multisensory integration. Here, using rapid streams of auditory and visual events, we demonstrate how humans can use temporal structure (rather than mere temporal coincidence to detect multisensory relatedness. We find psychophysically that participants can detect matching auditory and visual streams via shared temporal structure for crossmodal lags of up to 200 ms. Performance on this task reproduced features of past findings based on explicit timing judgments but did not show any special advantage for perfectly synchronous streams. Importantly, the complexity of temporal patterns influences sensitivity to correspondence. Stochastic, irregular streams – with richer temporal pattern information – led to higher audio-visual matching sensitivity than predictable, rhythmic streams. Our results reveal that temporal structure and its complexity are key determinants for human detection of audio-visual correspondence. The distinctive emphasis of our new paradigms on temporal patterning could be useful for studying special populations with suspected abnormalities in audio-visual temporal perception and multisensory integration.
Pitch is a psychoacoustic construct crucial in the production and perception of speech and songs. This article is an exploration of the interface of speech and song performance of Chinese speakers. Although parallels might be drawn from the prosodic and sound structures of the linguistic and musical systems, perceiving and producing speech and…
... of your treatment plan may include seeing a speech therapist , a person who is trained to treat speech disorders. How often you have to see the speech therapist will vary — you'll probably start out seeing ...
Carrell, Thomas D.
This study investigated the contributions of fundamental frequency, formant spacing, and glottal waveform to talker identification. The first two experiments focused on the effect of glottal waveform in the perception of talker identity. Subjects in the first experiment, 30 undergraduate students enrolled in an introductory psychology course,…
Archila-Suerte, Pilar; Zevin, Jason; Bunta, Ferenc; Hernandez, Arturo E.
Sensorimotor processing in children and higher-cognitive processing in adults could determine how non-native phonemes are acquired. This study investigates how age-of-acquisition (AOA) and proficiency-level (PL) predict native-like perception of statistically dissociated L2 categories, i.e., within-category and between-category. In a similarity…
Full Text Available We investigated the role of the syllable during speech processing in German, in an auditory-auditory fragment priming study with lexical decision and simultaneous EEG registration. Spoken fragment primes either shared segments (related with the spoken targets or not (unrelated, and this segmental overlap either corresponded to the first syllable of the target (e.g., /teis/ - /teisti/, or not (e.g., /teis/ - /teistləs/. Similar prime conditions applied for word and pseudoword targets. Lexical decision latencies revealed facilitation due to related fragments that corresponded to the first syllable of the target (/teis/ - /teisti/ in some but not all (/teist/ - /teistləs/ conditions. Despite segmental overlap, there were no positive effects for related fragments that mismatched the first syllable. No facilitation was observed for pseudowords. The EEG analyses showed a consistent effect of relatedness, independent of syllabic match, from 200 – 500 ms, including the P350 and N400 windows. Moreover, this held for words and pseudowords alike. The only specific effect of syllabic match for related prime - target pairs was observed in the time window from 200 – 300 ms. We discuss the nature and potential origin of these effects, and their relevance for speech processing and lexical access.
Douglas M Shiller
Full Text Available Auditory input is essential for normal speech development and plays a key role in speech production throughout the life span. In traditional models, auditory input plays two critical roles: 1 establishing the acoustic correlates of speech sounds that serve, in part, as the targets of speech production, and 2 as a source of feedback about a talker's own speech outcomes. This talk will focus on both of these roles, describing a series of studies that examine the capacity of children and adults to adapt to real-time manipulations of auditory feedback during speech production. In one study, we examined sensory and motor adaptation to a manipulation of auditory feedback during production of the fricative “s”. In contrast to prior accounts, adaptive changes were observed not only in speech motor output but also in subjects' perception of the sound. In a second study, speech adaptation was examined following a period of auditory–perceptual training targeting the perception of vowels. The perceptual training was found to systematically improve subjects' motor adaptation response to altered auditory feedback during speech production. The results of both studies support the idea that perceptual and motor processes are tightly coupled in speech production learning, and that the degree and nature of this coupling may change with development.
Full Text Available According to a traditional view, speech perception and production are processed largely separately in sensory and motor brain areas. Recent psycholinguistic and neuroimaging studies provide novel evidence that the sensory and motor systems dynamically interact in speech processing, by demonstrating that speech perception and imitation share regional brain activations. However, the exact nature and mechanisms of these sensorimotor interactions are not completely understood yet.Transcranial magnetic stimulation (TMS has often been used in the cognitive neurosciences, including speech research, as a complementary technique to behavioral and neuroimaging studies. Here we provide an up-to-date review focusing on TMS studies that explored speech perception and imitation.Single-pulse TMS of the primary motor cortex (M1 demonstrated a speech specific and somatotopically specific increase of excitability of the M1 lip area during speech perception (listening to speech or lip reading. A paired-coil TMS approach showed increases in effective connectivity from brain regions that are involved in speech processing to the M1 lip area when listening to speech. TMS in virtual lesion mode applied to speech processing areas modulated performance of phonological recognition and imitation of perceived speech.In summary, TMS is an innovative tool to investigate processing of speech perception and imitation. TMS studies have provided strong evidence that the sensory system is critically involved in mapping sensory input onto motor output and that the motor system plays an important role in speech perception.
Murakami, Takenobu; Ugawa, Yoshikazu; Ziemann, Ulf
According to a traditional view, speech perception and production are processed largely separately in sensory and motor brain areas. Recent psycholinguistic and neuroimaging studies provide novel evidence that the sensory and motor systems dynamically interact in speech processing, by demonstrating that speech perception and imitation share regional brain activations. However, the exact nature and mechanisms of these sensorimotor interactions are not completely understood yet. Transcranial magnetic stimulation (TMS) has often been used in the cognitive neurosciences, including speech research, as a complementary technique to behavioral and neuroimaging studies. Here we provide an up-to-date review focusing on TMS studies that explored speech perception and imitation. Single-pulse TMS of the primary motor cortex (M1) demonstrated a speech specific and somatotopically specific increase of excitability of the M1 lip area during speech perception (listening to speech or lip reading). A paired-coil TMS approach showed increases in effective connectivity from brain regions that are involved in speech processing to the M1 lip area when listening to speech. TMS in virtual lesion mode applied to speech processing areas modulated performance of phonological recognition and imitation of perceived speech. In summary, TMS is an innovative tool to investigate processing of speech perception and imitation. TMS studies have provided strong evidence that the sensory system is critically involved in mapping sensory input onto motor output and that the motor system plays an important role in speech perception.
Schierholz, Irina; Finke, Mareike; Kral, Andrej; Büchner, Andreas; Rach, Stefan; Lenarz, Thomas; Dengler, Reinhard; Sandmann, Pascale
There is substantial variability in speech recognition ability across patients with cochlear implants (CIs), auditory brainstem implants (ABIs), and auditory midbrain implants (AMIs). To better understand how this variability is related to central processing differences, the current electroencephalography (EEG) study compared hearing abilities and auditory-cortex activation in patients with electrical stimulation at different sites of the auditory pathway. Three different groups of patients with auditory implants (Hannover Medical School; ABI: n = 6, CI: n = 6; AMI: n = 2) performed a speeded response task and a speech recognition test with auditory, visual, and audio-visual stimuli. Behavioral performance and cortical processing of auditory and audio-visual stimuli were compared between groups. ABI and AMI patients showed prolonged response times on auditory and audio-visual stimuli compared with NH listeners and CI patients. This was confirmed by prolonged N1 latencies and reduced N1 amplitudes in ABI and AMI patients. However, patients with central auditory implants showed a remarkable gain in performance when visual and auditory input was combined, in both speech and non-speech conditions, which was reflected by a strong visual modulation of auditory-cortex activation in these individuals. In sum, the results suggest that the behavioral improvement for audio-visual conditions in central auditory implant patients is based on enhanced audio-visual interactions in the auditory cortex. Their findings may provide important implications for the optimization of electrical stimulation and rehabilitation strategies in patients with central auditory prostheses. Hum Brain Mapp 38:2206-2225, 2017. © 2017 Wiley Periodicals, Inc.
Wiersinga-Post, Esther; Tomaskovic, Sonja; Slabu, Lavinia; Renken, Remco; de Smit, Femke; Duifhuis, Hendrikus
Audiovisual processing was studied in a functional magnetic resonance imaging study using the McGurk effect. Perceptual responses and the brain activity patterns were measured as a function of audiovisual delay. In several cortical and subcortical brain areas, BOLD responses correlated negatively wi
A report on the proceedings and ideas expressed at a one day seminar on "Audio-Visual Equipment--Its Uses and Applications for Teaching and Research in Universities." The seminar was organized by England's National Committee for Audio-Visual Aids in Education in conjunction with the British Universities Film Council. (LS)
Ionescu, Bogdan; Seyerlehner, Klaus; Rasche, Christoph; Vertan, Constantin; Lambert, Patrick
We propose an audio-visual approach to video genre classification using content descriptors that exploit audio, color, temporal, and contour information. Audio information is extracted at block-level, which has the advantage of capturing local temporal information. At the temporal structure level, we consider action content in relation to human perception. Color perception is quantified using statistics of color distribution, elementary hues, color properties, and relationships between colors. Further, we compute statistics of contour geometry and relationships. The main contribution of our work lies in harnessing the descriptive power of the combination of these descriptors in genre classification. Validation was carried out on over 91 h of video footage encompassing 7 common video genres, yielding average precision and recall ratios of 87% to 100% and 77% to 100%, respectively, and an overall average correct classification of up to 97%. Also, experimental comparison as part of the MediaEval 2011 benchmarking campaign demonstrated the efficiency of the proposed audio-visual descriptors over other existing approaches. Finally, we discuss a 3-D video browsing platform that displays movies using feature-based coordinates and thus regroups them according to genre.
Full Text Available Michel Foucault ensina que toda fala sistemática - inclusive aquela que se afirma “neutra” ou “uma desinteressada visão objetiva do que acontece” - é, na verdade, mecanismo de articulação do saber e, na seqüência, de formação de poder. O aparecimento de novas tecnologias, especialmente as digitais, no campo da produção audiovisual, provoca uma avalanche de declarações de cineastas, ensaios de acadêmicos e previsões de demiurgos da mídia.
... 29 Labor 1 2010-07-01 2010-07-01 true Audiovisual coverage prohibited. 2.13 Section 2.13 Labor Office of the Secretary of Labor GENERAL REGULATIONS Audiovisual Coverage of Administrative Hearings § 2.13 Audiovisual coverage prohibited. The Department shall not permit audiovisual coverage of...
Cowan, Gloria; Khatchadourian, Desiree
Women are more intolerant of hate speech than men. This study examined relationality measures as mediators of gender differences in the perception of the harm of hate speech and the importance of freedom of speech. Participants were 107 male and 123 female college students. Questionnaires assessed the perceived harm of hate speech, the importance…
Mandeville, Mary Y.
As student diversity grows, it is important to make the basic speech communication course relevant for the students enrolled. Textbooks for the basic speech communication course provide the basic information on the subject; the responsibility for teaching the basic course in speech communication is often assigned to graduate teaching assistants.…
Full Text Available Severe to profound prelingual deafness that is either congenital or acquired is estimated to occur in 0.5 to 3 per 1000 live births. This is often associated with early delays in language development, speech perception, socialization and results in lower academic achievement. These de velopmental and behavioral problems are severe as 90 % of children are born to normal patients whereas with deaf parents it is less as they have a mutual communication. After much research in this field the first 22 channel cochlear implant surgery was don e in 1982. The number of prelingually deafened adults seeking cochlear implant is increasing as these individuals can derive substantial benefit, although their performance is poorer than adults with post - lingual deafness. MATERIAL AND METHODS : The present prospective study was conducted in the Department of ENT, Pt. J.N.M. Medical College and Dr. B. R.A.M. Hospital, Raipur (C.G. The subject selected were prelingually deafened individuals who were undergoing post cochlear implant speech therapy in the Depar tment. This study included individuals, who underwent cochlear implant surgery in this Department during the period of July, 2008 to September, 2010 and the age was within 10 years at the time of surgery. The study was designed as a prospective longitudina l analysis to asses functioning of patients, who underwent cochlear implantation. A total 37 cochlear implant surgeries were carried out in Department. Of these 3 cases were outside the age criteria of the present study and another 2 cases were lost in fol low up. Pre - operatively, detailed information of subject including the age, sex and address as well as contact number was collected. Then a General Examination was followed with reference to Built, Nutrition, Pulse, and Blood pressure, Oedema, Cyanosis, Cl ubbing and Citrus. A systemic examination was also performed. A Local Examination with special emphasis to tympanic membrane and any middle ear
Ito, Takayuki; Johns, Alexis R.; Ostry, David J.
Purpose: Somatosensory information associated with speech articulatory movements affects the perception of speech sounds and vice versa, suggesting an intimate linkage between speech production and perception systems. However, it is unclear which cortical processes are involved in the interaction between speech sounds and orofacial somatosensory…
Lund, Are Valen
This thesis examines how we perceive an audiovisual narrative - here defined as film, television and video games - and seeks to establish a descriptive framework for auditory stimuli and their narrative functions in this regard. I initially adopt the viewpoint of cognitive psychology an account for basic information processing operations. I then discuss audiovisual perception in terms of the effects of sensory integration between the visual and auditory modalities on the construction of meani...
Munson, Benjamin; Schellinger, Sarah K.; Edwards, Jan
Previous research has shown that continuous rating scales can be used to assess phonetic detail in children’s productions, and could potentially be used to detect covert contrasts. Two experiments examined whether continuous rating scales have the additional benefit of being less susceptible to task-related biasing than categorical phonetic transcriptions. In both experiments, judgments of children’s productions of /s/ and /θ/ were interleaved with two types of rating tasks designed to induce bias: continuous judgments of a parameter whose variation is itself relatively more continuous (gender typicality of their speech) in one biasing condition, and categorical judgments of a parameter that is relatively less-continuous (the vowel they produced) in the other biasing condition. One experiment elicited continuous judgments of /s/ and /θ/ productions, while the other elicited categorical judgments. The results of Experiment 1 showed that the influence of acoustic characteristics on continuous judgments of /s/ and /θ/ was stable across biasing conditions. In contrast, the results of Experiment 2 showed that the influence of acoustic characteristics on categorical judgments of /s/ and /θ/ differed systematically across biasing conditions. These results suggest that continuous judgments are psychometrically superior to categorical judgments, as they are more resistant to task-related bias. PMID:27736242
Liem, Franziskus; Hurschler, Martina A; Jäncke, Lutz; Meyer, Martin
This study combines functional and structural magnetic resonance imaging to test the "asymmetric sampling in time" (AST) hypothesis, which makes assertions about the symmetrical and asymmetrical representation of speech in the primary and nonprimary auditory cortex. Twenty-three volunteers participated in this parametric clustered-sparse fMRI study. The availability of slowly changing acoustic cues in spoken sentences was systematically reduced over continuous segments with varying lengths (100, 150, 200, 250 ms) by utilizing local time-reversion. As predicted by the hypothesis, functional lateralization in Heschl's gyrus could not be observed. Lateralization in the planum temporale and posterior superior temporal gyrus shifted towards the right hemisphere with decreasing suprasegmental temporal integrity. Cortical thickness of the planum temporale was automatically measured. Participants with an L > R cortical thickness performed better on the in-scanner auditory pattern-matching task. Taken together, these findings support the AST hypothesis and provide substantial novel insight into the division of labor between left and right nonprimary auditory cortex functions during comprehension of spoken utterances. In addition, the present data yield support for a structural-behavioral relationship in the nonprimary auditory cortex.
言语感知遵循音不离词，词不离句的原则。除了语音特征、音位和单词三个感知单元外，句子单元也参与了言语感知的过程。在这一感知过程中，句子语境分别从句法和语义两方面对词汇的识别发生影响。在句法方面，句子层依据句法规则对词汇层产生自上而下的反馈作用，通过词类限制和曲折形态特征核查等方式实现对词汇层上备选单词的筛选；在语义方面，句子层根据语义限制条件对备选单词产生激活或抑制作用。%Phonemes, words and sentences are interconnected in speech perception. Besides phonetic features, phonemes and words, sentences are also engaged in speech perception. In speech perception, sentimental contexts exert influence on word recog-nition both syntactically and semantically. Syntactically, sentence levels exert top-down feedback effect on world levels according to syntactic rules, screening the candidates on word levels by constraining their part of speech or checking their inflectional fea-tures. Semantically, sentence levels activate pr inhibit the candidates by exerting semantic constraints.
Morrill, Tuuli; Baese-Berk, Melissa; Heffner, Christopher; Dilley, Laura
During lexical access, listeners use both signal-based and knowledge-based cues, and information from the linguistic context can affect the perception of acoustic speech information. Recent findings suggest that the various cues used in lexical access are implemented with flexibility and may be affected by information from the larger speech context. We conducted 2 experiments to examine effects of a signal-based cue (distal speech rate) and a knowledge-based cue (linguistic structure) on lexical perception. In Experiment 1, we manipulated distal speech rate in utterances where an acoustically ambiguous critical word was either obligatory for the utterance to be syntactically well formed (e.g., Conner knew that bread and butter (are) both in the pantry) or optional (e.g., Don must see the harbor (or) boats). In Experiment 2, we examined identical target utterances as in Experiment 1 but changed the distribution of linguistic structures in the fillers. The results of the 2 experiments demonstrate that speech rate and linguistic knowledge about critical word obligatoriness can both influence speech perception. In addition, it is possible to alter the strength of a signal-based cue by changing information in the speech environment. These results provide support for models of word segmentation that include flexible weighting of signal-based and knowledge-based cues.
WANG Bei; Caroline Féry
This study is an investigation of the prosodic encoding of split noun sentences in Chinese Putonghua, for instance, ＂shu, wo mai le san ben. （Book, I buy ASP three CLAS. ＇I bought three books＇）＂, in which syntactic fronting highlights the split noun. The question- and-answer paradigm was used to construct contexts where the split noun is either the topic or the focus of the sentence. Acoustic analysis of 280 split sentences read by seven speakers show that the maximum F0 of the base part is higher and the pause after the split noun is shorter in the topic condition than that in the focus condition. But the split noun itself does not differ in either F0 or duration across the two conditions. A perception experiment further shows that the difference in prosody between the two conditions is perceivable, since matched question-and-statement pairs are preferred over unmatched ones.
Kuhl, Patricia K; Ramírez, Rey R; Bosseler, Alexis; Lin, Jo-Fu Lotus; Imada, Toshiaki
Historic theories of speech perception (Motor Theory and Analysis by Synthesis) invoked listeners' knowledge of speech production to explain speech perception. Neuroimaging data show that adult listeners activate motor brain areas during speech perception. In two experiments using magnetoencephalography (MEG), we investigated motor brain activation, as well as auditory brain activation, during discrimination of native and nonnative syllables in infants at two ages that straddle the developmental transition from language-universal to language-specific speech perception. Adults are also tested in Exp. 1. MEG data revealed that 7-mo-old infants activate auditory (superior temporal) as well as motor brain areas (Broca's area, cerebellum) in response to speech, and equivalently for native and nonnative syllables. However, in 11- and 12-mo-old infants, native speech activates auditory brain areas to a greater degree than nonnative, whereas nonnative speech activates motor brain areas to a greater degree than native speech. This double dissociation in 11- to 12-mo-old infants matches the pattern of results obtained in adult listeners. Our infant data are consistent with Analysis by Synthesis: auditory analysis of speech is coupled with synthesis of the motor plans necessary to produce the speech signal. The findings have implications for: (i) perception-action theories of speech perception, (ii) the impact of "motherese" on early language learning, and (iii) the "social-gating" hypothesis and humans' development of social understanding.
Yang, Weiping; Yang, Jingjing; Gao, Yulin; Tang, Xiaoyu; Ren, Yanna; Takahashi, Satoshi; Wu, Jinglong
A combination of signals across modalities can facilitate sensory perception. The audiovisual facilitative effect strongly depends on the features of the stimulus. Here, we investigated how sound frequency, which is one of basic features of an auditory signal, modulates audiovisual integration. In this study, the task of the participant was to respond to a visual target stimulus by pressing a key while ignoring auditory stimuli, comprising of tones of different frequencies (0.5, 1, 2.5 and 5 kHz). A significant facilitation of reaction times was obtained following audiovisual stimulation, irrespective of whether the task-irrelevant sounds were low or high frequency. Using event-related potential (ERP), audiovisual integration was found over the occipital area for 0.5 kHz auditory stimuli from 190-210 ms, for 1 kHz stimuli from 170-200 ms, for 2.5 kHz stimuli from 140-200 ms, 5 kHz stimuli from 100-200 ms. These findings suggest that a higher frequency sound signal paired with visual stimuli might be early processed or integrated despite the auditory stimuli being task-irrelevant information. Furthermore, audiovisual integration in late latency (300-340 ms) ERPs with fronto-central topography was found for auditory stimuli of lower frequencies (0.5, 1 and 2.5 kHz). Our results confirmed that audiovisual integration is affected by the frequency of an auditory stimulus. Taken together, the neurophysiological results provide unique insight into how the brain processes a multisensory visual signal and auditory stimuli of different frequencies.
Schepers, Inga M; Yoshor, Daniel; Beauchamp, Michael S
Human speech contains both auditory and visual components, processed by their respective sensory cortices. We test a simple model in which task-relevant speech information is enhanced during cortical processing. Visual speech is most important when the auditory component is uninformative. Therefore, the model predicts that visual cortex responses should be enhanced to visual-only (V) speech compared with audiovisual (AV) speech. We recorded neuronal activity as patients perceived auditory-only (A), V, and AV speech. Visual cortex showed strong increases in high-gamma band power and strong decreases in alpha-band power to V and AV speech. Consistent with the model prediction, gamma-band increases and alpha-band decreases were stronger for V speech. The model predicts that the uninformative nature of the auditory component (not simply its absence) is the critical factor, a prediction we tested in a second experiment in which visual speech was paired with auditory white noise. As predicted, visual speech with auditory noise showed enhanced visual cortex responses relative to AV speech. An examination of the anatomical locus of the effects showed that all visual areas, including primary visual cortex, showed enhanced responses. Visual cortex responses to speech are enhanced under circumstances when visual information is most important for comprehension.
Full Text Available This research aimed to improve the skill in appreciating dances owned by the students of Primary Teacher Education of Makassar State University, to improve the perception towards audio-visual based art appreciation, to increase the students’ interest in audio-visual based art education subject, and to increase the students’ responses to the subject. This research was classroom action research using the research design created by Kemmis & MC. Taggart, which was conducted to 42 students of Primary Teacher Education of Makassar State University. The data collection was conducted using observation, questionnaire, and interview. The techniques of data analysis applied in this research were descriptive qualitative and quantitative. The results of this research were: (1 the students’ achievement in audio-visual based dance appreciation improved: precycle 33,33%, cycle I 42,85% and cycle II 83,33%, (2 the students’ perception towards the audio-visual based dance appreciation improved: cycle I 59,52%, and cycle II 71,42%. The students’ perception towards the subject obtained through structured interview in cycle I and II was 69,83% in a high category, (3 the interest of the students in the art education subject, especially audio-visual based dance appreciation, increased: cycle I 52,38% and cycle II 64,28%, and the students’ interest in the subject obtained through structured interview was 69,50 % in a high category. (3 the students’ response to audio-visual based dance appreciation increased: cycle I 54,76% and cycle II 69,04% in a good category.
Leung, Johahn; Wei, Vincent; Burgess, Martin; Carlile, Simon
The ability to actively follow a moving auditory target with our heads remains unexplored even though it is a common behavioral response. Previous studies of auditory motion perception have focused on the condition where the subjects are passive. The current study examined head tracking behavior to a moving auditory target along a horizontal 100° arc in the frontal hemisphere, with velocities ranging from 20 to 110°/s. By integrating high fidelity virtual auditory space with a high-speed visual presentation we compared tracking responses of auditory targets against visual-only and audio-visual "bisensory" stimuli. Three metrics were measured-onset, RMS, and gain error. The results showed that tracking accuracy (RMS error) varied linearly with target velocity, with a significantly higher rate in audition. Also, when the target moved faster than 80°/s, onset and RMS error were significantly worst in audition the other modalities while responses in the visual and bisensory conditions were statistically identical for all metrics measured. Lastly, audio-visual facilitation was not observed when tracking bisensory targets.
Full Text Available The ability to actively follow a moving auditory target with our heads remains unexplored even though it is a common behavioral response. Previous studies of auditory motion perception have focused on the condition where the subjects are passive. The current study examined head tracking behavior to a moving auditory target along a horizontal 100° arc in the frontal hemisphere, with velocities ranging from 20°/s to 110°/s. By integrating high fidelity virtual auditory space with a high-speed visual presentation we compared tracking responses of auditory targets against visual-only and audio-visual bisensory stimuli. Three metrics were measured – onset, RMS and gain error. The results showed that tracking accuracy (RMS error varied linearly with target velocity, with a significantly higher rate in audition. Also, when the target moved faster than 80°/s, onset and RMS error were significantly worst in audition the other modalities while responses in the visual and bisensory conditions were statistically identical for all metrics measured. Lastly, audio-visual facilitation was not observed when tracking bisensory targets.
Leone, Dorothy; Levy, Erika S.
Purpose: Much of a child's day is spent listening to speech in the presence of background noise. Although accurate vowel perception is important for listeners' accurate speech perception and comprehension, little is known about children's vowel perception in noise. "Clear speech" is a speech style frequently used by talkers in the…
Dov, David; Talmon, Ronen; Cohen, Israel
In this paper, we address the problem of multiple view data fusion in the presence of noise and interferences. Recent studies have approached this problem using kernel methods, by relying particularly on a product of kernels constructed separately for each view. From a graph theory point of view, we analyze this fusion approach in a discrete setting. More specifically, based on a statistical model for the connectivity between data points, we propose an algorithm for the selection of the kernel bandwidth, a parameter, which, as we show, has important implications on the robustness of this fusion approach to interferences. Then, we consider the fusion of audio-visual speech signals measured by a single microphone and by a video camera pointed to the face of the speaker. Specifically, we address the task of voice activity detection, i.e., the detection of speech and non-speech segments, in the presence of structured interferences such as keyboard taps and office noise. We propose an algorithm for voice activity detection based on the audio-visual signal. Simulation results show that the proposed algorithm outperforms competing fusion and voice activity detection approaches. In addition, we demonstrate that a proper selection of the kernel bandwidth indeed leads to improved performance.
Keitel, Christian; Müller, Matthias M
Our brain relies on neural mechanisms of selective attention and converging sensory processing to efficiently cope with rich and unceasing multisensory inputs. One prominent assumption holds that audio-visual synchrony can act as a strong attractor for spatial attention. Here, we tested for a similar effect of audio-visual synchrony on feature-selective attention. We presented two superimposed Gabor patches that differed in colour and orientation. On each trial, participants were cued to selectively attend to one of the two patches. Over time, spatial frequencies of both patches varied sinusoidally at distinct rates (3.14 and 3.63 Hz), giving rise to pulse-like percepts. A simultaneously presented pure tone carried a frequency modulation at the pulse rate of one of the two visual stimuli to introduce audio-visual synchrony. Pulsed stimulation elicited distinct time-locked oscillatory electrophysiological brain responses. These steady-state responses were quantified in the spectral domain to examine individual stimulus processing under conditions of synchronous versus asynchronous tone presentation and when respective stimuli were attended versus unattended. We found that both, attending to the colour of a stimulus and its synchrony with the tone, enhanced its processing. Moreover, both gain effects combined linearly for attended in-sync stimuli. Our results suggest that audio-visual synchrony can attract attention to specific stimulus features when stimuli overlap in space.
tone condition [F = (2, 39 = 47.18, p < .001, η2 = .55] and group [F = (2, 39 = 75.89, p = .017, η2 = .14], with T5 eliciting more positive responses than T2, and stronger responses from the [+Per+Pro] than [+Per-Pro] group. Correlations between production accuracy of the two rising tones and perceptual measures found that the averaged production accuracy was negatively correlated with the discrimination RT (r = -.502, p = .001, with shorter discrimination RTs associated with higher production accuracy. In addition, the production accuracy was positively correlated with the mean amplitude of brain responses to rise time of T5 (r = .421, p = .006, the larger the response, the higher the production accuracy. In summary, the present study demonstrated that tone perception is highly dynamic and exploits different acoustic cues at different stages of processing – rise time at the sensory/perceptual level and pitch feature at the cognitive level, as the auditory signal unfolds over time. Moreover, our findings revealed differential sensitivities between individuals with and without distinctive production of the two rising tones as evidenced by the differences in discrimination latency of the two tones and magnitude of brain response to short rise time. The individual differences found in production are proposed to have a perceptual origin, in that less defined phonological representations lead to less distinctive production.
The information systems and audio-visual documentation in the televisions are part of a great gear for the good operation of the audio-visual companies. In the present work are the main characteristics of the audio-visual documentation within the framework of the televising audio-visual organizations offering an express crossed on the aspects more excellent than the main users of these services must know. The article tries to demonstrate the importance and to show the possibilities that offer...
This master thesis analyzes the ethical challenges journalists have in their work, with special regard to code of conduct and hate speech. When it comes to the issue of hate speech, this master thesis focuses at hate speech directed to minorities in Turkey. The media market in Turkey is highly regulated by laws and regulations. As a result of that several newspapers have been in trouble with the law. This in turn leads to self-censorship in the business. Two media groups o...
... 29 Labor 1 2010-07-01 2010-07-01 true Audiovisual coverage permitted. 2.12 Section 2.12 Labor Office of the Secretary of Labor GENERAL REGULATIONS Audiovisual Coverage of Administrative Hearings § 2.12 Audiovisual coverage permitted. The following are the types of hearings where the...
This paper provides an attempted study of the speech perception,concentrating on acoustic phonetic aspects of the processes,which underline the capacity to identify the phonological structure of speech.
Camila Angélica Quintino
Full Text Available Aparelho de Amplificação Sonora Individual (AASI. OBJETIVO: Comparar o desempenho, benefício e a satisfação de usuários de AASI intra-aural e retroauricular digital com algoritmo de redução de ruído e microfones omnidirecional e direcional. MATERIAL E MÉTODO: 34 usuários de AASI digital foram avaliados por meio do reconhecimento de sentenças no ruído e dos questionários APHAB e IOI. Estudo prospectivo. RESULTADOS: Melhores resultados foram obtidos com AASI intra-aurais e AASI direcionais, no entanto, não houve diferença estatística significante entre os grupos. CONCLUSÃO: A direcionalidade favoreceu o reconhecimento de fala no ruído e o benefício obtido em vida diária.Hearing aid. AIM: To compare the performance, benefit and satisfaction of users of ITE, CIC and BTE digital hearing aid with noise reduction and omnidirectional and directional microphones. METHOD: 34 users of hearing aid were evaluated by means of speech perception in noise tests and APHAB and IOI self assessment questionnaires. Prospective study. RESULTS: Better results were obtained by users of ITE, CIC and directional hearing aids, however, no statistical significance was found between the groups. CONCLUSION: Directivity improved speech perception in noise and benefit in daily life situations.
Baskent, Deniz; Clarke, Jeanne; Pals, Carina; Benard, Michel R.; Bhargava, Pranesh; Saija, Jefta; Sarampalis, Anastasios; Wagner, Anita; Gaudrain, Etienne
External degradations in incoming speech reduce understanding, and hearing impairment further compounds the problem. While cognitive mechanisms alleviate some of the difficulties, their effectiveness may change with age. In our research, reviewed here, we investigated cognitive compensation with hea
Beatrice de Gelder
Full Text Available Multisensory integration must stand out among the fields of research that have witnessed a most impressive explosion of interest this last decade. One of these new areas of multisensory research concerns emotion. Since our first exploration of this phenomenon (de Gelder et al., 1999 a number of studies have appeared and they have used a wide variety of behavioral, neuropsychological and neuroscientifc methods. The goal of this presentation is threefold. First, we review the research on audiovisual perception of emotional signals from the face and the voice followed by a report or more recent studies on integrating emotional information provided by the voice and whole body expressions. We will also include some recent work on multisensory music perception. In the next section we discuss some methodological and theoretical issues. Finally, we will discuss findings about abnormal affective audiovisual integration in schizophrenia and in autism.
Raja Nadales, Daniel
Creació de tres Píndoles audiovisuals d'aproximadament 3 minuts de durada, compostes per una sèrie de consells relacionats amb la salut, la cura de pacients i el seu entorn, creant una funció d'utilitat a l'usuari. Les píndoles estan complementades per un llenguatge de fàcil comprensió i enteniment i estan subjectes a una lliure accessibilitat mitjançant la distribució per Internet, adaptades a qualsevol aparell electrònic de reproducció audiovisual.
王硕; 董瑞娟; Solveig Christina Voss; 钱金宇; 吴燕君; 张华
目的：本研究对感音神经性听力损失患者助听器选配后的言语识别能力进行评价，并分析听力损失程度与年龄对助听后言语康复效果的影响。方法30名感音神经性听力损失受试者，男13名，女17名，年龄26-86岁，双侧听力损失程度对称，双耳0.5-4 kHz频率下纯音听力阈值（PTA0.5-4 kHz）平均值40～75 dB HL。所有受试者均选配Phonak Bolero Q50系列耳背式助听器。使用汉语普通话言语测试软件（Mandarin Speech Test Materials, MSTMs）进行裸耳和助听后安静与噪声环境下言语识别能力测试。结果（1）助听后，安静环境下的双音节识别率平均提高35.1±19.5%；噪声环境下语句识别率平均提高32.8±22.8%；（2）助听后言语识别能力与听力损失程度呈显著负相关关系；（3）助听优势高于平均水平的受试者纯音听阈均大于50 dB HL，但存在个体差异大的特点。结论助听器选配可以有效帮助感音神经性听力损失患者提高言语识别能力，但听力损失程度不是唯一影响助听效果的因素，助听后言语识别能力的改善存在较大个体差异。%Objective This study was aimed at evaluating the speech perception performance in sensorineural hear-ing-impaired listeners with hearing aids. Methods Thirty subjects with sensorineural hearing loss were recruited, including 13 males and 17 females with the age ranging from 26 to 86 years. They had bilaterally symmetric hearing loss with the av-eraged 0.5-4 kHz PTA ranging from 40 to 75 dB HL. They were fitted with Phonak Bolero Q50 BTE hearing aids unilaterally. The Mandarin Speech Test Materials (MSTMs) software was used to test speech perception performance under four condi-tions, including unaided quiet, aided quiet, unaided noisy and aided noisy environments. Results (1) After fitting hearing aids, the speech perception score in quiet using bisyllabic materials improved by 35.1±19.5%in average
Blanca Rodríguez Bravo
Full Text Available Se analizan las peculiaridades del documento audiovisual y el tratamiento documental que sufre en las emisoras de televisión. Observando a las particularidades de la imagen que condicionan su análisis y recuperación, se establecen las etapas y procedimientos para representar el mensaje audiovisual con vistas a su reutilización. Por último se realizan algunas consideraciones acerca del procesamiento automático del video y de los cambios introducidos por la televisión digital.Peculiarities of the audio-visual document and the treatment it undergoes in TV broadcasting stations are analyzed. The particular features of images condition their analysis and recovery; this paper establishes stages and proceedings for the representation of audio-visual messages with a view to their re-usability Also, some considerations about the automatic processing of the video and the changes introduced by digital TV are made.
Cogan, Gregory B; Thesen, Thomas; Carlson, Chad; Doyle, Werner; Devinsky, Orrin; Pesaran, Bijan
Historically, the study of speech processing has emphasized a strong link between auditory perceptual input and motor production output. A kind of 'parity' is essential, as both perception- and production-based representations must form a unified interface to facilitate access to higher-order language processes such as syntax and semantics, believed to be computed in the dominant, typically left hemisphere. Although various theories have been proposed to unite perception and production, the underlying neural mechanisms are unclear. Early models of speech and language processing proposed that perceptual processing occurred in the left posterior superior temporal gyrus (Wernicke's area) and motor production processes occurred in the left inferior frontal gyrus (Broca's area). Sensory activity was proposed to link to production activity through connecting fibre tracts, forming the left lateralized speech sensory-motor system. Although recent evidence indicates that speech perception occurs bilaterally, prevailing models maintain that the speech sensory-motor system is left lateralized and facilitates the transformation from sensory-based auditory representations to motor-based production representations. However, evidence for the lateralized computation of sensory-motor speech transformations is indirect and primarily comes from stroke patients that have speech repetition deficits (conduction aphasia) and studies using covert speech and haemodynamic functional imaging. Whether the speech sensory-motor system is lateralized, like higher-order language processes, or bilateral, like speech perception, is controversial. Here we use direct neural recordings in subjects performing sensory-motor tasks involving overt speech production to show that sensory-motor transformations occur bilaterally. We demonstrate that electrodes over bilateral inferior frontal, inferior parietal, superior temporal, premotor and somatosensory cortices exhibit robust sensory-motor neural
Oh, Anna; Duerden, Emma G; Pang, Elizabeth W
Lesion and neuroimaging studies indicate that the insula mediates motor aspects of speech production, specifically, articulatory control. Although it has direct connections to Broca's area, the canonical speech production region, the insula is also broadly connected with other speech and language centres, and may play a role in coordinating higher-order cognitive aspects of speech and language production. The extent of the insula's involvement in speech and language processing was assessed using the Activation Likelihood Estimation (ALE) method. Meta-analyses of 42 fMRI studies with healthy adults were performed, comparing insula activation during performance of language (expressive and receptive) and speech (production and perception) tasks. Both tasks activated bilateral anterior insulae. However, speech perception tasks preferentially activated the left dorsal mid-insula, whereas expressive language tasks activated left ventral mid-insula. Results suggest distinct regions of the mid-insula play different roles in speech and language processing.
Freeman, Elliot D; Ipser, Alberta; Palmbaha, Austra; Paunoiu, Diana; Brown, Peter; Lambert, Christian; Leff, Alex; Driver, Jon
The sight and sound of a person speaking or a ball bouncing may seem simultaneous, but their corresponding neural signals are spread out over time as they arrive at different multisensory brain sites. How subjective timing relates to such neural timing remains a fundamental neuroscientific and philosophical puzzle. A dominant assumption is that temporal coherence is achieved by sensory resynchronisation or recalibration across asynchronous brain events. This assumption is easily confirmed by estimating subjective audiovisual timing for groups of subjects, which is on average similar across different measures and stimuli, and approximately veridical. But few studies have examined normal and pathological individual differences in such measures. Case PH, with lesions in pons and basal ganglia, hears people speak before seeing their lips move. Temporal order judgements (TOJs) confirmed this: voices had to lag lip-movements (by ∼200 msec) to seem synchronous to PH. Curiously, voices had to lead lips (also by ∼200 msec) to maximise the McGurk illusion (a measure of audiovisual speech integration). On average across these measures, PH's timing was therefore still veridical. Age-matched control participants showed similar discrepancies. Indeed, normal individual differences in TOJ and McGurk timing correlated negatively: subjects needing an auditory lag for subjective simultaneity needed an auditory lead for maximal McGurk, and vice versa. This generalised to the Stream-Bounce illusion. Such surprising antagonism seems opposed to good sensory resynchronisation, yet average timing across tasks was still near-veridical. Our findings reveal remarkable disunity of audiovisual timing within and between subjects. To explain this we propose that the timing of audiovisual signals within different brain mechanisms is perceived relative to the average timing across mechanisms. Such renormalisation fully explains the curious antagonistic relationship between disparate timing
Zla ba sgrol ma
This file contains a plowing speech and a discussion about the speech This collection presents forty-nine audio files including: several folk song genres; folktales and; local history from the Sman shad Valley of Sde dge county World Oral Literature Project
Ordelman, R.J.F.; Jong, de F.M.G.; Leeuwen, van D.A.; Blanken, H.M.; de Vries, A.P.; Blok, H.E.; Feng, L.
This chapter will focus on the automatic extraction of information from the speech in multimedia documents. This approach is often referred to as speech indexing and it can be regarded as a subfield of audio indexing that also incorporates for example the analysis of music and sounds. If the objecti
Ravishankar, C., Hughes Network Systems, Germantown, MD
Speech is the predominant means of communication between human beings and since the invention of the telephone by Alexander Graham Bell in 1876, speech services have remained to be the core service in almost all telecommunication systems. Original analog methods of telephony had the disadvantage of speech signal getting corrupted by noise, cross-talk and distortion Long haul transmissions which use repeaters to compensate for the loss in signal strength on transmission links also increase the associated noise and distortion. On the other hand digital transmission is relatively immune to noise, cross-talk and distortion primarily because of the capability to faithfully regenerate digital signal at each repeater purely based on a binary decision. Hence end-to-end performance of the digital link essentially becomes independent of the length and operating frequency bands of the link Hence from a transmission point of view digital transmission has been the preferred approach due to its higher immunity to noise. The need to carry digital speech became extremely important from a service provision point of view as well. Modem requirements have introduced the need for robust, flexible and secure services that can carry a multitude of signal types (such as voice, data and video) without a fundamental change in infrastructure. Such a requirement could not have been easily met without the advent of digital transmission systems, thereby requiring speech to be coded digitally. The term Speech Coding is often referred to techniques that represent or code speech signals either directly as a waveform or as a set of parameters by analyzing the speech signal. In either case, the codes are transmitted to the distant end where speech is reconstructed or synthesized using the received set of codes. A more generic term that is applicable to these techniques that is often interchangeably used with speech coding is the term voice coding. This term is more generic in the sense that the
Ana Tereza de Matos Magalhães
Full Text Available As novas tecnologias do processador Freedom® foram criadas para proporcionar melhorias no processamento do som acústico de entrada, não apenas para novos usuários, como para gerações anteriores de implante coclear. OBJETIVO: Identificar a contribuição da tecnologia do processador de fala Freedom® para implante coclear multicanal, Nucleus22®, no desempenho de percepção de fala no silêncio e no ruído, e nos limiares audiométricos. MATERIAL E MÉTODO: A forma de estudo foi de coorte histórico com corte transversal. Dezessete pacientes preencheram os critérios de inclusão. Antes de iniciar os testes, o último mapa em uso com o Spectra® foi revisto e otimizado e o funcionamento do processador foi verificado. Os testes de fala foram apresentados a 60dBNPS em material gravado: monossílabos; frases em apresentação aberta no silêncio; e no ruído (SNR = 0dB. Foram realizadas audiometrias em campo livre com ambos os processadores de fala. A análise estatística utilizou testes não-paramétricos. RESULTADOS: Quando analisada a contribuição do Freedom® para pacientes com Nucleus22®, observa-se diferença estatisticamente significativa em todos os testes de percepção de fala e em todos os limiares audiométricos. CONCLUSÃO: A tecnologia contribuiu no desempenho de percepção de fala e nos limiares audiométricos dos pacientes usuários de Nucleus22®.New technology in the Freedom® speech processor for cochlear implants was developed to improve how incoming acoustic sound is processed; this applies not only for new users, but also for previous generations of cochlear implants. AIM: To identify the contribution of this technology - the Nucleus 22® - on speech perception tests in silence and in noise, and on audiometric thresholds. METHODS: A cross-sectional cohort study was undertaken. Seventeen patients were selected. The last map based on the Spectra® was revised and optimized before starting the tests. Troubleshooting
Describes results of survey of media service directors at public universities in Ohio to determine the expected longevity of audiovisual equipment. Use of the Delphi technique for estimates is explained, results are compared with an earlier survey done in 1977, and use of spreadsheet software to calculate depreciation is discussed. (LRW)