Functional imaging during speech production
Introduction
The production of spoken words is the end product of a complex network of linguistic and cognitive processes. Thoughts and intentions are remarkably transformed into a sequence of movements and sounds in the order of hundreds of milliseconds. One of the great achievements of psycholinguistic research in the past 30 years has been the preliminary system identification of this language output system. Distinct phases of planning and control in this process have been identified through the study of errors (e.g., Fromkin, 1973) and studies of the timing of production (Levelt, 1989) and formal models of language production have been proposed to account for these data.
Levelt's model is representative of these frameworks and it is summarized in Fig. 1. As can be seen in this figure, the progression from concept to articulation is thought to involve separate processes that impart syntactic, morphological, phonological and phonetic organization. At the end of this flow of information processing lies the final and often forgotten stage of the language sequence, articulation itself. Most psycholinguistic models of this kind give minimal attention to the final stage of actual speech production. Like a last minute addition to a guest list, articulation does not feature in early phases of planning. In fact, in many models of language production, articulation is portrayed as a passive and modular final filter for language production. For a variety of technical reasons, this de-emphasis of articulation has been mirrored in functional imaging studies of language production. Many neural imaging studies have used a range of “silent” tasks in which words are not even spoken.1 This is unfortunate because it de-emphasizes the medium in which the evolution of language presumably took place and it makes an unwarranted “pure insertion” assumption about speech motor control (Jennings, McIntosh, Kapur, Tulving, & Houle, 1997).
There are many reasons for giving the articulation component of language production more attention in functional imaging studies. Foremost amongst these is the fact that the speech motor system is not a passive channel for the transmission of linguistic signals from prior planning stages. Rather, it transforms those signals in a number of ways. At the most peripheral level, the nonlinear mechanics of tissue and muscle, the inertial forces of the moving articulators and the complexities of force generation in muscles each contribute to the form of articulation and thus to the final form of the speech output. In order to accommodate this degree of nonlinearity, the nervous system is thought to have internal models of the vocal tract and the acoustic consequences of articulation (Guenther, 1994; Hirayama, Vatikiotis-Bateson, & Kawato, 1994; Jones & Munhall, 2000; Jordan, 1990, Jordan, 1996; Kawato, Furukawa, & Suzuki, 1987). An internal model is a representation of the forces and kinematics of movements as well as the feedback that results from those movements (see Miall & Wolpert, 1996; Kawato, 1999, for discussions of internal models in movement control). In this depiction of the act of speaking, articulation involves significant information processing in order to manage the motor system in the vocal tract during phonetic sequencing. Timing, force, and trajectory control of a large number of articulators have to be programmed in rapid succession. This computation engages a neurally distributed planning and control system.
The amount of information that must be conveyed and therefore programmed by the motor system is remarkable. Talkers modify their speech style and precision in real time to fit social and environmental context and to respond to conversational demands. Mood, intent, attention, as well as the conceptual and emotional meaning of a message are all transmitted in parallel with phonetic information by subtle differences in the way in which words are spoken. These subtle differences are produced by systematic modifications to the movement timing and trajectories of the oral articulators. For example, talkers can voluntarily modify the clarity with which they speak (Gagne, Masterson, Munhall, Bilida, & Querengesser, 1995) and speech is perceptually distinct if it is read, recited, voluntarily produced, attentionally engaged, etc. (e.g., Remez, Lipton, & Fellowes, 1996). In addition, there is variability in articulation from a number of other sources. Speaking rate, lexical and emphatic stress and syllabification can vary from repetition to repetition.
How this motor planning is accomplished is not well understood but an interaction between levels of planning is indicated. For example, Saltzman, Lofqvist, and Mitra (2000) have used a phase resetting paradigm to study the relationship between central “clock” mechanisms and the timing of peripheral motor events in speech. When randomly timed mechanical perturbations are applied to the lower lip during repetitive syllable production, the patterns of timing adjustments are consistent with a model in which central timing is modulated by the peripheral motor system (Gracco & Abbs, 1989). Kawato (1997) has suggested that this arrangement is preferred for computational reasons as well. During trajectory planning, the solution space can be reduced more effectively if constraints (e.g., smoothness) are specified at other planning levels (Kawato, 1997). To do this however, the trajectory planning (along with other information processing problems in motor control such as coordinate transformations and motor command generation) must be computed simultaneously rather than sequentially.
Understanding articulation, then, is not simply a matter of understanding how the jaw, for example, moves for a canonical phoneme target. Rather, such an understanding must include a description of coordination in the context of the range of performance conditions occurring during communication. Trajectory formation for the articulators would thus involve interaction with many levels of the language system as well as involving complex motor planning in order to achieve these goals in the presence of the biomechanical and physiological complexity of the vocal tract (Munhall, Kawato, & Vatikiotis-Bateson, 2000).
The nonadditivity of cognitive processing and type of response has recently been demonstrated in a semantic judgment task (Jennings et al., 1997). PET activation patterns2 varied for the same semantic task depending on the response mode (mouse click, overt speech, silent thought). Jennings et al. suggest that the way in which subjects organized their responses induced activation in different areas of the brain for the semantic processing. This kind of interaction is consistent with a system with many reciprocal connections and feedback projections such as the language system.
Factors such as these argue strongly for careful study of speech production in functional imaging studies. From a practical point of view, the manner in which words are spoken defines part of the “task” in any language production study and the presence of real articulation permits more precise task monitoring in imaging studies. Response timing, accuracy and task compliance can only be measured when there actually is a response! There is good evidence that this should be a concern in language production studies. Untrained talkers are not very precise at controlling many speech variables in laboratory studies of language output. For example, speaking rate, in spite of its popularity as a manipulation in speech production studies, is notoriously variable (Miller & Baer, 1983). Rate of articulation has also been reported to influence regional blood flow in PET experiments (e.g., Paus, Perry, Zatorre, Worsley, & Evans, 1996).
Indefrey and Levelt (2000) have recently argued that a range of poorly understood “lead-in” processes used in neuroimaging studies of word production (e.g., picture naming, generating words from beginning letter, word reading, repetition) have to be more carefully considered since they influence neuroimaging results. In a similar vein, details of articulation define a range of “lead-out” processes that will influence imaging data. Variability in speech obviously can result from different neural control strategies at many levels of the production system and this will be reflected in different neural activation patterns.
In addition to these methodological reasons for studying the functional representation of overt speech there are more theoretical reasons as well. Speech can be viewed as a complex linguistic process (Liberman & Whalen, 2000), and thus the neural representation of speech gestures is seen as an inextricable part of the language system. In Liberman's view, the phonetic gestures are the “common currency” of two-way communication and thus are the primitives of human language.
More pragmatically, there is a set of classic problems in the understanding of articulation that can profitably be addressed using functional imaging. These include (a) specifying the fundamental units of speech coordination, understanding the implementation of these units in a range of contexts that influence precision and timing, (b) understanding how central representation of units interact with feedback in real time and (c) understanding how these units are acquired and modified by learning. For each of these problems imaging could aid in both the neural mapping of the behavior and potentially the specification of the processing components of the behavior (Kosslyn, 1999). In the final section of the paper I will return to these problems.
To date, functional imaging has played two distinct roles in the study of speaking. First, a form of functional structural imaging has long been important for the precise description of the actual behavior of talking. Secondly, functional neural imaging has recently contributed to our understanding of the cognitive and linguistic processes responsible for articulation and for mapping these putative processes onto the neural architecture. In this paper I will give an overview of these two roles played by imaging, particularly magnetic resonance imaging (MRI), and comment on how functional imaging can help resolve enduring problems in the field. I will begin by commenting on the state of functional structural and functional neural imaging using MR.
Section snippets
Structural imaging of speaking
Sound production in human speech involves the generation of sound (e.g., by the vocal folds) and its filtering by the acoustic properties of the vocal tract (i.e., shape, size, wall characteristics, etc.). This division of explanation into source and filter has been the working model in speech research for the last 50 years (Fant, 1950) but speech bioacoustics and speech motor control are still far from completely understood. In part, this is due to the relative inaccessibility of the speech
Functional neural imaging of speaking
It has been known for over a century that multiple regions of the brain are involved in producing spoken language. Our knowledge of these neuroanatomical correlates of speech production primarily comes from deficit/lesion studies (see Caplan, 1992) and electrical stimulation studies (e.g., Ojemann, 1991; Penfield & Roberts, 1959) but recently, a number of imaging techniques have also been used. (See Savoy, 2001, for an extensive historical review.)
The key components of the language production
Three problems in speech motor control
As indicated above, recording the neural correlates of articulation is a methodologically difficult task. However, the full problem involves more challenges than just the complexity of the data acquisition. As Indefrey and Levelt (2000) suggest, some kind of task decomposition is required before functional imaging of language production can succeed. We need to specify the necessary and sufficient components of speaking aloud. In doing so, we not only must identify the key neural subsystems of
Conclusions
After more than a decade of functional neural imaging of language production, we have seen significant advances in our understanding of language production as well as the methodologies for studying it. A variety of new imaging and neural recording techniques are available to researchers today (e.g., fMRI, PET, MEG, ERP, etc.), each with its own strengths and weaknesses. The study of articulation, the end product of language production, is particularly challenging and may require the use of
Acknowledgments
This work was funded by NIH grant DC-00594 from the National Institute of Deafness and other Communications Disorders and NSERC. The author wishes to thank Jeff Jones for helpful comments on an earlier draft of this manuscript. Brad Story (Fig. 2) and Mark Tiede (Fig. 3) generously shared their vocal tract images.
References (120)
- et al.
The contribution of the cerebellum to speech processing
Journal of Neurolinguistics
(2000) - et al.
Overt verbal responding during fMRI scanning: empirical investigations of problems and potential solutions
NeuroImage
(1999) - et al.
PET studies of phonological processing: a critical reply to Poeppel
Brain and Language
(1996) - et al.
Susceptibility-induced loss of signal: comparing PET and IMRI on a semantic task
NeuroImage
(2000) - et al.
Linguistic processing
International Review of Neurobiology
(1997) - et al.
Cognitive subtractions may not add up: the interaction between semantic processing and response mode
NeuroImage
(1997) Internal models for motor control and trajectory planning
Current Opinion in Neurobiology
(1999)- et al.
Articulatory timing in selected consonant sequences
Brain and Language
(1975) - et al.
On the relation of speech to language
Trends in Cognitive Science
(2000) - et al.
Forward models for physiological motor control
Neural Networks
(1996)
Inhibition of auditory cortical neurous during phonation
Brain Research
Subcortical aphasias
Brain and Language
Differential effects of overt, covert and replayed speech on vowel-evoked responses of the human auditory cortex
Neuroscience Letters
Subject's own speech reduces reactivity of the human auditory cortex
Neuroscience Letters
Can neuroimaging really tell us what the brain is doing? The relevance of indirect measures of population activity
Acta Psychologica
Integrating cognitive psychology, neurology, and neuroimaging
Acta Psychologica
Speech motor control: acoustic goals,saturation effects, auditory feedback and internal models
Speech Communication
Critical review of PET studies of phonological processing
Brain and Language
Imaging brain plasticity: conceptual and methodological issues – a theoretical review
NeuroImage
The effect of varying stimulus rate and duration on brain activity during reading
NeuroImage
Control of complex motor gestures: orofacial muscle responses to load perturbations of the lip during speech
Journal of Neurophysiology
Analysis of vocal tract shape and dimensions using magnetic resonance imaging: vowels
Journal of the Acoustical Society of America
The neural bases of prosody: insights from lesion studies and neuroimaging
Aphasiology
Magnetic field changes in the human brain due to swallowing or speaking
Magnetic Resonance in Medicine
Event-related fMRI of tasks involving brief motion
Human Brain Mapping
Voice F0 responses to manipulations in pitch feedback
Journal of the Acoustical Society of America
An auditory-feedback based neural network model of speech production that is robust to developmental changes in the size and shape of articulatory systems
Journal of Speech, Language & Hearing Research
Language
Cortical and subcortical control of tongue movement in humans: a functional neuroimaging study using fMRI
Journal of Applied Physiology
Neuronal activity in the human lateral temporal lobe. II. Responses to the subjects own voice
Experimental Brain Research
Aphasia with lesions in the basal ganglia and internal capsule
Archives in Neurology
Morphological and acoustical analysis of the nasal and the paranasal cavities
Journal of the Acoustical Society of America
A new brain region for co-ordinating speech articulation
Nature
Acoustic theory of speech production
Phonology, semantics, and the role of the left inferior prefrontal cortex
Human Brain Mapping
Lip and jaw motor control during speech: responses to resistive loading of the jaw
Journal of Speech & Hearing Research
Brain correlates of stuttering and syllable production
Brain
Speech errors as linguistic evidence
Across talker variability in speech intelligibility for conversational and clear speech
Journal of the Academy of Rehabilitative Audiology
Dynamic control of the perioral system during speech: kinematic analyses of autogenic and nonautogenic sensorimotor processes
Journal of Neurophysiology
Central patterning of speech movements
Experimental Brain Research
Sensorimotor characteristics of speech motor sequences
Experimental Brain Research
A neural network model of speech acquisition and motor equivalent production
Biological Cybernetics
Cortical speech processing mechanisms while vocalizing visually presented languages
NeuroReport
Cited by (37)
Test-retest reliability in an fMRI study of naming in dementia
2019, Brain and LanguageA verbal strength in children with Tourette syndrome? Evidence from a non-word repetition task
2016, Brain and LanguageCitation Excerpt :Importantly, this also holds for non-word repetition and related tasks (Kalm & Norris, 2014; Liegeois, Morgan, Connelly, & Vargha-Khadem, 2011; Papoutsi et al., 2009; Peeva et al., 2010; Strand, Forssberg, Klingberg, & Norrelgen, 2008). Moreover, children with TS appear to have faster speech rates than TD children (De Nil et al., 2005), and speech production also depends on these structures (Munhall, 2001). Overall, these data lend support to the hypothesis that the frontal/basal ganglia and dopaminergic abnormalities in TS may underlie not only the tics and hyperkinetic profile, but also the rapid processing reported for a variety of tasks in the disorder, including non-word repetition.
Using active shape modeling based on MRI to study morphologic and pitch-related functional changes affecting vocal structures and the airway
2014, Journal of VoiceCitation Excerpt :This is important because uncovering factors responsible for such variability could lead to new insights and, therefore, testable hypotheses concerning fundamental questions in voice science6: for example, what mechanisms underlie the “singing formant”7 and the rise and fall of the larynx with changes of voice pitch?8,9 Traditionally, the focus of MRI in voice science is restricted to the investigation of changing dimensions of the vocal tract and vocal structures and changing relationships between articulators such as the lips, jaws, tongue, and soft palate (velum).1,10 In previous studies, it was suggested that this focus was too narrow because such an approach neglects to account for the fact that all vocal structures have direct and/or indirect attachments to the skeletal frame, that is, the skull, cervical spine, sternum, and scapula.11,12
A neural theory of speech acquisition and production
2012, Journal of NeurolinguisticsMultichannel fNIRS assessment of overt and covert confrontation naming
2012, Brain and LanguageCitation Excerpt :However, it should be noted that such estimates have been made mostly in reference to studies on covert word production where subjects speak subvocally. Overt articulation (where subjects speak normally) involves the activation of additional brain regions not activated during silent word processing (Birn, Cox, & Bandettini, 2004), and has been known to activate various motor and cognitive processes (Munhall, 2001; Quaresima, Ferrari, van der Sluijs, Menssen, & Colier, 2002). Also, differences between covert and overt word production have been indicated across imaging studies (Indefrey & Levelt, 2004).