High-dimensional semantic space accounts of priming☆
Introduction
A common finding in the psycholinguistic literature is that a word is processed more efficiently when it is preceded by processing of a related word. The common assumption is that the first word (the prime) facilitates processing of the second word (the target) because it contains within it some of the mental code required for the second response (Rosch, 1975). In semantic priming, the magnitude of facilitation depends on the semantic similarity between the prime and target. For example, nurse is processed more efficiently when preceded by doctor than when preceded by bread (Meyer & Schvaneveldt, 1971). For this reason, priming has been the predominant task used to study the structure of semantic memory (more specifically, representation of word meaning).
It remains the topic of considerable debate whether semantic priming effects are a result of semantic overlap per se, or are simply due to learned association strength between primes and targets (for reviews, see Huchinson, 2003, Lucas, 2000, McNamara, 2005, Neely, 1991). The debate has important consequences for the opposing localist and distributed approaches to representing word meaning.
Localist models (e.g., semantic networks; Collins & Quillian, 1972) assume that words are represented by nodes of interconnected concepts. Words that are connected to one another by more (or shorter) pathways are more similar in meaning. Localist models account for semantic priming by applying the construct of spreading activation (Collins & Loftus, 1975). When nodes in a network are activated, the activation spreads along the associated pathways to related nodes. The spread of activation makes the connected nodes already partially activated when a related concept is processed. Although spreading activation is an important explanatory concept in semantic networks (Balota & Lorch, 1986), it is important to note that it is a process construct that operates on the structural representation in a semantic network. In any model, priming requires both an account of the process as well as an account of the structure upon which the process operates.
By contrast, distributed models assume that word meaning is a pattern of elements in an array; the elements may be individually interpretable (e.g., feature lists) or only meaningful as part of an aggregate abstract pattern (e.g., connectionist representations). In a feature list theory (Smith, Shoben, & Rips, 1974), words are represented by lists of binary descriptive features. For example, birds have wings and dogs do not. Semantic priming is accounted for in feature lists simply by overlapping features between the prime and target. Whereas robin shares no features with chair, it has more shared features with bat, and even more with sparrow. In a connectionist representation, a word’s meaning is distributed over an aggregate pattern of element weights, but none of the elements has interpretable meaning on its own.
A major problem with both feature list and semantic network theories is that the models do not actually learn anything—the semantic representations must be built into the model by the theorist himself. Hand-coded representations rely on intuition of semantic similarity and dimensionality (either by the theorist, or subjective norms, e.g., McRae, de Sa, & Seidenberg, 1997), and may be an inaccurate representation of the information that is truly salient for semantic representation. Hummel and Holyoak (2003) have noted that hand-coded representations are a serious problem if cognitive modeling is to be a truly scientific enterprise: “All models are sensitive to their representation, so the choice of representation is among the most powerful wildcards at the modeler’s disposal” (p. 247).
In addition, hand coding representations artificially hardwires complexity into a model. Assuming that the complexity required for semantic representation is available in the environment, it is more appealing for a model to use simple mechanisms to learn its representations from statistical redundancies in the environment, rather than the theorist building complexity into the model based on intuition. The notion of automatically learning representations from environmental redundancies is the motivation behind recent co-occurrence models (e.g., Landauer and Dumais, 1997, Lund and Burgess, 1996).
Co-occurrence models attempt to build semantic representations for words directly from statistical co-occurrences in text. Typically, words are represented in a high-dimensional semantic space (cf. Osgood, 1952, Osgood, 1971, Salton, 1973, Salton et al., 1975). For this reason, such models are often referred to as “semantic space” models. Co-occurrence models capitalize on the frequency of words in contexts across a large sample of text. The co-occurrence approach minimizes representation and processing assumptions because much of the model’s complexity is learned from the environment—it is not hardwired into the model. For example, to know what hammer means, the model will observe all the contexts in which hammer is used. One may infer that hammer is related to other frequent words in those contexts, such as nail and board. Further, one may induce that hammer is similar to words that appear in similar contexts (i.e., with the same words), such as mallet or screwdriver. By the same logic, hammer is likely less similar to chromosome because they tend not to appear in the same or similar contexts.
In Latent Semantic Analysis (LSA; Deerwester et al., 1990, Landauer and Dumais, 1997), a large-scale text corpus is first transformed into a sparse word-by-document frequency matrix, typically using about 90,000 words and about 40,000 documents. The entries are then converted to log-frequency values, and are divided by the word’s entropy, −Σp log p, over all its documents. Next, the dimensionality of the word-by-document matrix is reduced using singular value decomposition (SVD) so that each word is now represented by a dense vector of approximately 300 dimensions, however, the dimensions have no particular meaning or direct correspondence to the text. SVD has the effect of bringing out latent semantic relationships between words, even if they have never co-occurred in the same document. The basic premise in LSA is that the aggregate contexts in which a word does and does not appear provides a set of mutual constraints to induce the word’s meaning (Landauer, Foltz, & Laham, 1998).
LSA has been successful at simulating a wide range of psychological and psycholinguistic phenomena, from judgments of semantic similarity (Landauer & Dumais, 1997) to word categorization (Laham, 2000), discourse comprehension (Kintsch, 1998), and judgments of essay quality (Landauer, Laham, Rehder, & Schreiner, 1997). LSA has even earned college entrance-level grades on the TOEFL, and has been shown to acquire vocabulary at a rate that is comparable to standard developmental trends (Landauer & Dumais, 1997).
LSA capitalizes on a word’s contextual co-occurrence, but not how a word is used in that context. Information about the meaning of hammer can be determined by observing the contexts in which it appears. However, the contexts also contain temporal redundancy (grammatical information) about how the word is used relative to other words. Very rarely is a nail ever used to pound a board into a hammer; temporal redundancy reveals information about the word’s order relation to other words in the context. Further, this order information reveals that a hammer may be more similar to a mallet or hatchet in how it is used in context than it is to screwdriver or nail. Even though screwdriver and nail may have more contextual overlap with hammer, they are not used in the same way within those contexts. How a word is used in context can carry as much variance to induce its meaning as what contexts it appears in (and, obviously, these are correlated sources of information).
The Hyperspace Analogue to Language (HAL; Burgess and Lund, 2000, Lund and Burgess, 1996) is related to LSA, but also capitalizes on positional similarities between words across contexts. HAL is trained by moving an n-word window across text and calculating the distance (in word steps) between all words that occur in the window at each point in time. HAL’s co-occurrence matrix is a sparse word-by-word (70,000 × 70,000) matrix in which a word’s row entry records the frequency, inversely weighted by distance (summed word steps), that the word appeared in the window succeeding every other word possible, and a word’s column entry records the frequency (inversely weighted by distance) that the word appeared in the window preceding every other word. After training, the row and column vectors for a word are concatenated to yield the word’s representation. Words that have appeared similar distances around the same words can develop similar patterns of elements in their vectors. Thus, both contextual co-occurrence and positional similarity are represented in HAL.
In HAL, words that appear in similar positions around the same words tend to develop the most similar vector representations. Note that two words need not directly co-occur within the window to develop similar representations. Two words that appeared around the same words will be similar, and this relationship is magnified if they are also found similar distances relative to other words. In HAL, not only do similar nouns (e.g., cat–dog) have similar vector representations, but so do other lexical classes, such as determiners, prepositions, and animate and inanimate nouns (Audet and Burgess, 1998, Burgess and Lund, 2000). HAL can be envisioned as a large-scale approximation of the structure that could be learned by a simple-recurrent network (SRN; Elman, 1990, Elman, 1991, Servan-Schreiber et al., 1991), and has been shown to learn representations that have very similar structure to SRNs when both are trained on small finite-state grammars (Burgess & Lund, 2000). Although HAL does not explicitly encode the order of words, its distance weighting can serve as a proxy for order information (Perfetti, 1998).
LSA and HAL consider subtlety different types of information while learning text, and these differences are reflected in the structural representations formed by each model. LSA tends to weight associative relationships more highly than purely semantic relationships. For example, the representation for car is much more similar to the representation to drive (cos = 0.73) than it is to members of the same semantic category, such as truck (cos = 0.49) or boat (cos = 0.03). Further, the verb drive is more similar to car than it is to other action verbs, such as walk (cos = 0.23).
By contrast, HAL considers distance between intervening words in the moving window; hence, semantic relationships can become more highly weighted in HAL than associative relationships. In HAL, car is more similar to truck (d = 0.90) and boat (d = 0.95) than it is to drive (d = 1.12), and the verb drive is more similar to another action verb like walk (d = 1.03) than it is to car (d = 1.12).
HAL and LSA focus on different sources of information and, thus, make different predictions about the strength of semantic and associative relationships in memory. The two types of information are correlated, but each model also learns unique variance not considered by the other. The question is whether the unique sources of variance from each type of learning are both needed to account for the structure of semantic memory. Clearly, humans take advantage of both types of information (e.g., the words hammer is found in context with and how hammer is used relative to those words), and an ideal model of semantic representation would consider both sources of information when learning text.
Attempts to consider both types of information have traditionally used vectors to represent contextual semantics, and rules or productions systems for order information (e.g., Wiemer-Hastings, 2000, Wiemer-Hastings, 2001). Hence, the two types of information are stored separately and in a different form. Another approach, taken by Dennis (2005), has used a Bayesian adaptation of string edit theory to represent both syntagmatic and paradigmatic information within a single model. Similarly, Griffiths, Steyvers, Blei, and Tenenbaum (2005) have successfully combined the two sources of information in a generative framework, using a hidden-Markov model to learn sequential dependencies and a probabilistic topic model to learn semantic relationships. Our goal is to apply mechanisms from associative memory theory to learn a single vector representation for a word, containing a mixture of both contextual and word-order information. In doing so, we wish to demonstrate that information about word order is used in representing a word’s meaning, and that the simple mechanisms used in other types of associative learning are sufficient to capitalize on this structure without postulating mechanisms for encoding order that are specific to language.
In the domain of associative memory, Murdock, 1982, Murdock, 1992, Murdock, 1993 has used convolution as a mechanism to build associations between pairs of vectors representing words or objects. Murdock represents information about items as random vectors, and information about their associations as convolutions of the item vectors. Both item and associative representations are then summed together and stored in a composite distributed memory representation. The composite representation can be used to determine if an item was learned: A novel item vector will have an expected dot product of zero with the composite representation, and a learned item vector will have a much higher dot product (however, the magnitude depends both on dimensionality and on number of items stored). Further, when a learned item vector is correlated with the memory representation (the inverse of convolution), the result is a facsimile of the vector representing the item with which it was associated. If a novel item vector is correlated with the memory representation, the result will not resemble any item known. Murdock’s storage of item and associative information in a composite memory representation affords the possibility to learn both contextual and order information into a composite lexical representation if the same ideas were adapted to learn from language.
Convolution is basically a method of compressing the outer-product matrix of two vectors; the convolution of two vectors produces a third vector that does not resemble either argument vector, but is rather a key storing their association. When one member of the learned pair is later encountered in the environment and compared to the associative key (via correlation), the other member of the learned pair is reconstructed. Such a process is very useful because an object can be retrieved without ever storing it—it is reconstructed from an item in the environment and a stored association. Further, several pairs of associations can be summed together in the same memory vector. Because convolution distributes over addition, a single representation can be used to represent several associative keys. Once again, when one member of a learned pair is correlated with the representation, the other member is reconstructed; if an unknown item is correlated with the representation, however, no known item can be retrieved. Such convolution–correlation memory models are often referred to as holographic models because they are based on the same mathematical principles used in light holography (see Plate, 2003 for a review).
A common problem with aperiodic (linear) convolution is that the associative representation is 2n − 1 dimensions larger than the vectors representing the items themselves (where n is the dimensionality of the item vectors). For example, the convolution of two item vectors, x and y is:Basically, the diagonals of the outer-product matrix are summed, producing a 2n − 1 dimensional association. Thus, vectors representing items and their associations cannot be directly summed together because they have different dimensionality. To finesse the problem, many memory models pad the item vectors with zeros to balance dimensionality (e.g., Murdock, 1982), or simply truncate the association vector by trimming the outside elements to match the dimensionality of the item vectors (e.g., Metcalfe-Eich, 1982).
Although padding and truncation are adequate solutions for models of paired-associate learning, neither is appropriate for application to unconstrained language. Such patches still limit convolution to be used to learn pairwise associations and will miss the higher-order temporal structure that is characteristic of natural languages. To recursively bind together vectors representing all words in sentences (without expanding dimensionality) we employ circular convolution, a technique used extensively in image and signal processing (Gabel & Roberts, 1973; see also Plate, 1995, Plate, 2003 for examples in cognitive modeling). The circular convolution of k n-dimensional vectors always produces an n-dimensional association vector, without wasting information by truncating or expanding dimensionality by padding:1Circular convolution is also referred to as cyclic or wrapped convolution because, rather than summing linearly down each diagonal of the outer-product matrix, the summing wraps around the diagonals in modulo-n steps. Hence, all elements in the matrix are summed, but dimensionality remains constant.
We will use the circular convolution algorithm to learn associations between words in a memory model we call BEAGLE (Bound Encoding of the Aggregate Language Environment). BEAGLE constructs distributed representations for words from experience with a large-scale text corpus (text will be read and processed in one-sentence increments). The resulting representation will contain roughly the types of information inherent in both HAL and LSA, stored in a composite holographic lexicon. For example, BEAGLE will learn the types of words that share contextual information with hammer, and the types of words that share associative information (position relative to other words) with hammer.
The first time a word is encountered when reading the text corpus it is assigned a random environmental vector, ei, which represents its physical characteristics (e.g., orthography or phonology). At this point, we are agnostic about the actual environmental structure; hence, we assume no structural similarities between words, and represent each with a different random representation. Environmental vector elements are sampled at random from a Gaussian distribution with μ = 0 and , where D is the vector dimensionality.2 Each time a word is encountered while reading the text corpus, the same environmental vector is used to represent it.
A word’s memory representation, mi, however, is adapted each sentence in which the word occurs by adding the sentence context to it. A word’s context in a sentence, c, is simply the sum of the environmental representations for the other words in the sentence:3This new context is then added to the word’s memory representation:A word’s memory representation, thus, develops a pattern of elements that reflects its history of co-occurrence with other words in sentences. In addition, latent similarity can be formed in the lexicon between words that have never directly co-occurred in a sentence but, nonetheless, have occurred in similar contexts (around the same words) during learning. This is analogous to a latent relationship in LSA, but the relationship simply emerges from context accumulation (summing of similar random vectors) rather than SVD.
At the same time as context information is being learned for the sentence being processed so is order information, that is, information about the word’s position relative to other words in the sentence. A word’s order information is formed by binding it with all n-gram chunks in the sentence that include it with directional circular convolution. The position of the word being coded is represented by a constant random placeholder vector, Φ (sampled from the same element distribution as were the environmental vector elements). Each n-gram association is unique. For example, e1 ⊛ e2 produces a different vector from e1 ⊛ e2 ⊛ e3, (i.e., an association for a trigram is different than that for a bigram, even if the trigram contains the bigram) but both operations produce fixed-dimensional vectors so they can be directly compared and stored.
Because circular convolution is used, all n-gram associations are represented in the same fixed dimensionality and, hence, they can all be summed into a single order vector that represents the word’s position relative to all other words in the sentence. The order information, o, for a word in a sentence is thus:where p is the position of the word in the sentence, and bi,j (b for “binding”) is the jth convolution chunk for the word being coded.4
For example, consider coding the memory representation for excellent in the simple sentence “dingoes make excellent pets.” The memory representation for excellent, mexcellent, is updated by adding the word’s context and order information from the new sentence, coded from the environmental representations for the other words:5The memory representation for a word, mi, thus, becomes a pattern of elements that reflects the word’s history of co-occurrence with, and position relative to, other words in sentences. The context information alone is a approximation to the kind of semantic structure that LSA learns, and the order information alone is similar to the type of structure learned by HAL or an SRN. BEAGLE’s learning algorithms, however, allow it to learn both types of information into a single composite representation.
Table 1 demonstrates the structure leaned by the context and order equations separately when BEAGLE is trained on a text corpus. For each target word (capitalized), the eight nearest neighbors for each space are displayed (i.e., the eight words that have developed the most similar memory representations to the target). When comparing words learned by only context information, for example, bird is most similar to associated words, such as wings, beak, and nest. In the context-only lexicon, verbs are similar to the nouns they operate upon. For example, food is related to eat, car is related to drive, and book is related to read, but eat, drive, and read are not highly related to one another, nor are food, car, or book. By contrast, when comparing words learned by the only order information, bird is most similar to other animals. In the order-only lexicon, words that appear in similar positions to other words in sentences develop similar structure from accumulation of common associations during learning. Drive, eat and read are all similar to one another, and cluster distinctly from the nouns (car, food, and book now being similar to one another).
The representations learned by BEAGLE are basically a blend of these two types of structure. The model contains information learned by both LSA and HAL, from very simple summation and association mechanisms and without the need for dimensional optimization. Unlike HAL, BEAGLE explicitly encodes order relations, rather than tabulating distances. Both types of information are stored together as a single composite vector pattern. Jones and Mewhort (in press) have demonstrated that BEAGLE’s composite representations more closely predict semantic relatedness in Miller’s (1995) WordNet, than does LSA. In addition, the composite representation is as good a predictor of WordNet measures as both of its component representations taken together. Hence, compressing context and order information into a single composite representation does not seem to interfere with either type of information.
Further, order sequences that were learned can be retrieved from the lexicon using inverse convolution (much in the same way Murdock, 1982, Murdock, 1992 retrieves items given a cue) allowing the model to perform a variety of word-transition tasks without the need for built-in transition rules. The focus of the present paper, however, will only examine the structure of the learned lexical representations, and does not require decoding of word transitions. For more information on decoding equations in BEAGLE and predicting word transitions in sentences, see Jones and Mewhort (in press).
Section snippets
Comparing model structure to data structure
In this section, we compare the similarity structure of representations learned by HAL, LSA, and BEAGLE to response latency data from human subjects in a range of semantic priming experiments. Of particular interest are experiments examining “purely” semantic overlap between primes and targets, associative-only prime-target relationships, and mediated prime-target relationships.
For the simulations reported in this paper, all three models were trained on the same text corpus, compiled by
General discussion
Semantic space models are particularly appealing because they learn representations for words automatically from statistical characteristics of language. The approach solves the “hand coding” problem inherent in models of semantic representation such as feature lists and semantic networks,8 and leaves much of the representation complexity in
References (71)
- et al.
Semantic and associative priming in the cerebral hemispheres: Some words do, some words don’t…sometimes, some places
Brain & Language
(1990) - et al.
Mediated priming in the lexical decision task: Evidence from event-related potentials and reaction time
Journal of Memory and Language
(2000) - et al.
An attractor model of lexical conceptual processing: Simulating semantic priming
Cognitive Science
(1999) The range of automatic spreading activation in word priming
Journal of Verbal Learning and Verbal Behavior
(1983)Finding structure in time
Cognitive Science
(1990)- et al.
A model of grounded language acquisition: Sensorimotor features improve lexical and grammatical learning
Journal of Memory and Language
(2005) - et al.
Depth of spreading activation revisited: Semantic mediated priming occurs in lexical decisions
Journal of Memory and Language
(1988) Derivations for the chunking model
Journal of Mathematical Psychology
(1993)- et al.
Judgments of frequency and recency in a distributed memory model
Journal of Mathematical Psychology
(2001) - et al.
Representing properties locally
Cognitive Psychology
(2001)
Distinguishing between manner of motion and inherently directed motion verbs using a high-dimensional memory space and semantic judgments
Proceedings of the Annual Meeting of the Cognitive Science Society
Depth of automatic spreading activation: Mediated priming effects in pronunciation but not in lexical decision
Journal of Experimental Psychology: Learning, Memory, and Cognition
Category norms for verbal items in 56 categories: A replication and extension of the Connecticut category norms
Journal of Experimental Psychology Monograph
Three-step priming in lexical decision
Memory & Cognition
A spreading-activation theory of semantic processing
Psychological Review
How to make a language user
Indexing by latent semantic analysis
Journal of the American Society for Information Science
A memory-based theory of verbal cognition
Cognitive Science
Distributed representations, simple recurrent networks, and grammatical structure
Machine Learning
Semantic and associative priming in the mental lexicon
An analysis of immediate memory: The free-recall task
Signals and linear systems
Integrating topics and syntax
Advances in Neural Information Processing Systems
Is semantic priming due to association strength or feature overlap? A microanalytic review
Psychonomic Bulletin & Review
A symbolic-connectionist theory of relational inference and generalization
Psychological Review
Comprehension: A paradigm for cognition
A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge
Psychological Review
An introduction to latent semantic analysis
Discourse Processes
How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans
Mediated priming in high-dimensional semantic space: No effect of direct semantic relationships or co-occurrence
Brain and Cognition
Semantic priming without association: A meta-analytic review
Psychonomic Bulletin & Review
Cited by (0)
- ☆
This research was supported by grants from NSERC and Sun Microsystems to DM, and an IERI grant to WK. MJ was supported by a postdoctoral fellowship from NSERC. We would like to thank Mark Steyvers and Jim Neely for comments on a earlier version of this manuscript.