ArticlePDF AvailableLiterature Review

Semantic Memory: A Review of Methods, Models, and Current Challenges

August 2020
Psychonomic Bulletin & Review 28(6794)

August 2020
28(6794)

Authors:

Bowdoin College

Adult semantic memory has been traditionally conceptualized as a relatively static memory system that consists of knowledge about the world, concepts, and symbols. Considerable work in the past few decades has challenged this static view of semantic memory, and instead proposed a more fluid and flexible system that is sensitive to context, task demands, and perceptual and sensorimotor information from the environment. This paper (1) reviews traditional and modern computational models of semantic memory, within the umbrella of network (free association-based), feature (property generation norms-based), and distributional semantic (natural language corpora-based) models, (2) discusses the contribution of these models to important debates in the literature regarding knowledge representation (localist vs. distributed representations) and learning (error-free/Hebbian learning vs. error-driven/predictive learning), and (3) evaluates how modern computational models (neural network, retrieval-based, and topic models) are revisiting the traditional "static" conceptualization of semantic memory and tackling important challenges in semantic modeling such as addressing temporal, contextual, and attentional influences , as well as incorporating grounding and compositionality into semantic representations. The review also identifies new challenges regarding the abundance and availability of data, the generalization of semantic models to other languages, and the role of social interaction and collaboration in language learning and development. The concluding section advocates the need for integrating representational accounts of semantic memory with process-based accounts of cognitive behavior, as well as the need for explicit comparisons of computational models to human baselines in semantic tasks to adequately assess their psychological plausibility as models of human semantic memory.

Original network proposed by Collins and Quillian (1969), reprinted from Balota &Coane (2008)

…

Large-scale visualization of a directed semantic network created by Steyvers and Tenenbaum (2005) and shortest path between RELEASE to ANCHOR, adapted from Kumar, Balota, & Steyvers (2019).

…

The high-dimensional space produced by HAL from co-occurrence word vectors, adapted from Lund & Burgess (1996).

…

A depiction of the skip-gram version of the word2vec model architecture. The model is creating a vector representation for the word lived by predicting its surrounding words in the sentence "Jane's mother lived in Paris". The weights of the hidden layer represent the vector representation for the word lived, as the model performs the prediction task and adjusts the weights based on the prediction error. Adapted from Günther et al. (2019).

…

A depiction of a typical convolutional neural network that detects vertical edges in an image. A sliding filter is multiplied with the pixelized image to produce a matrix, and then a pooling step combines results from the convolved output into a smaller matrix by selecting the maximum value from each 2x2 sub-matrix in the convolved matrix. This final 2x2 matrix represents the final representation of the image highlighting the vertical edges.

…

Figures - uploaded by Abhilasha Kumar

Content may be subject to copyright.

Content uploaded by Abhilasha Kumar

Content may be subject to copyright.

Semantic Memory: A Review of Methods, Models, and Current Challenges

Abhilasha A. Kumar

Washington Un iv er si ty i n St L ou is , MO

Adult semantic memory has been traditionally conceptualized as a relatively static memory system

that consists of knowledge about the world, concepts, and symbols. Considerable work in the past few

decades has challenged this static view of semantic memory, and instead proposed a more fluid and flex-

ible system that is sensitive to context, task demands, and perceptual and sensorimotor information from

the environment. This paper (1) reviews traditional and modern computational models of semantic

memory, within the umbrella of network (free association-based), feature (property generation norms-

based), and distributional semantic (natural language corpora-based) models, (2) discusses the contribu-

tion of these models to important debates in the literature regarding knowledge representation (localist

vs. distributed representations) and learning (error-free/Hebbian learning vs. error-driven/predictive

learning), and (3) evaluates how modern computational models (neural network, retrieval-based, and topic

models) are revisiting the traditional “static” conceptualization of semantic memory and tackling im-

portant challenges in semantic modeling such as addressing temporal, contextual, and attentional influ-

ences, as well as incorporating grounding and compositionality into semantic representations. The review

also identifies new challenges regarding the abundance and availability of data, the generalization of se-

mantic models to other languages, and the role of social interaction and collaboration in language learning

and development. The concluding section advocates the need for integrating representational accounts of

semantic memory with process-based accounts of cognitive behavior, as well as the need for explicit

comparisons of computational models to human baselines in semantic tasks to adequately assess their

psychological plausibility as models of human semantic memory.

Keywords: distributional semantic models; semantic memory; neural networks; semantic networks

Note: This is a post-peer-review, pre-copyedit version of an article published in Psychonomic

Bulletin and Review. The final authenticated version will be made available online at:

https://doi.org/10.3758/s13423-020-01792-x.

1 Introduction

What does it mean to know what an ostrich is? The

question of how meaning is represented and organized

by the human brain has been at the forefront of explo-

rations in philosophy, psychology, linguistics, and

computer science for centuries. Does knowing the

meaning of an ostrich involve having a prototypical

representation of an ostrich that has been created by

averaging over multiple exposures to individual os-

triches? Or does it instead involve extracting particular

features that are characteristic of an ostrich (e.g., it is

big, it is a bird, it does not fly, etc.), that are acquired

via experience, and stored and activated upon encoun-

tering an ostrich? Further, is this knowledge stored

through abstract and arbitrary symbols such as words,

or is it grounded in sensorimotor interactions with the

physical environment?

Acknowledgements: I sincerely thank David A. Balota,

Jeffrey M. Zacks, Michael N. Jones, and Ian G. Dobbins

for their extremely insightful feedback and helpful

comments on earlier versions of this manuscript.

Correspondence concerning this article should be

addressed to Abhilasha A. Kumar, Department of

Psychological and Brain Sciences, Washington

University in St. Louis, Campus Box 1125, One

Brookings Drive, St. Louis, MO 63130. E-mail:

abhilasha.kumar@wustl.edu

The computation of meaning is fundamental to all cog-

nition, and hence it is not surprising that considerable

work has attempted to uncover the mechanisms that

contribute to the construction of meaning from experi-

ence.

There have been several important historical seeds

that have laid the groundwork for conceptualizing how

meaning is learned and represented. One of the earliest

attempts to study how meaning is represented was by

Osgood (1952; also see Osgood, Suci, & Tannenbaum,

1957) through the use of the semantic differential tech-

nique. Osgood collected participants’ ratings of con-

cepts (e.g., peace) on several polar scales (e.g., hot-

cold, good-bad, etc.), and using multidimensional scal-

ing, showed that these ratings aligned themselves along

three universal dimensions: evaluative (good-bad), po-

tency (strong-weak) and activity (active-passive). Os-

good’s work was important in the following two ways:

(a) it introduced an empirical tool to study the nature of

semantic representations; (b) it provided early evidence

that the meaning of a concept or word may actually be

distributed across several dimensions, in contrast to be-

ing represented through a localist representation, i.e.,

through a single dimension, feature, or node. As subse-

quently discussed, this contrast between localist and

distributed meaning representations has led to different

modeling approaches to understanding how meaning is

learned and represented.

Another important milestone in the study of mean-

ing was the formalization of the distributional hypoth-

esis (Harris, 1970), best captured by the phrase “you

shall know a word by the company it keeps” (Firth,

1957), which dates back to Wittgenstein’s early intui-

tions (1953) about meaning representation. The idea

behind the distributional hypothesis is that meaning is

learned by inferring how words co-occur in natural lan-

guage. For example, ostrich and egg may become re-

lated because they frequently co-occur in natural lan-

guage, whereas ostrich and emu may become related

because they co-occur with similar words. This distri-

butional principle has laid the groundwork for several

decades of work in modeling the explicit nature of

meaning representation. Importantly, despite the fact

that several distributional models in the literature do

make use of distributed representations, it is their learn-

ing process of extracting statistical redundancies from

natural language that makes them distributional in na-

ture.

Another historically significant event in the study of

meaning was Tulving’s (1972) classic distinction be-

tween episodic and semantic memory. Tulving pro-

posed two subdivisions of declarative memory: epi-

sodic memory, consisting of memories of experiences

linked to specific times and places (e.g., seeing an os-

trich at the zoo last month), and semantic memory, stor-

ing general knowledge about the world and what verbal

symbols (i.e., words) mean in an amodal (i.e., not

linked to any specific modality) memory store (e.g.,

storing what an ostrich is, what it looks like etc.

through words). Although there are long-standing de-

bates regarding the strong distinction between semantic

and episodic memory (e.g., McKoon, Ratcliff, & Dell,

1986), this dissociation was supported by early neuro-

psychological studies of amnestic patients who were

able to acquire new semantic knowledge without hav-

ing any concrete memory for having learned this infor-

mation (Gabrieli, Cohen, & Corkin, 1988; O’Kane,

Kensinger, & Corkin, 2004). Indeed, the relative inde-

pendence of these two types of memory systems has

guided research efforts for many years, as is evidenced

by early work on computational models of semantic

memory. As described below, this perspective is begin-

ning to change with the onset of more recent modeling

perspectives.

These theoretical seeds have driven three distinct

approaches to modeling the structure and organization

of semantic memory: associative network models, dis-

tributional models, and feature-based models. Associa-

tive network models are models that represent words as

individual nodes in a large memory network, such that

words that are related in meaning are connected to each

other through edges in the network (e.g., Collins &

Quillian, 1969; Collins & Loftus, 1975). On the other

hand, inspired the distributional hypothesis, Distribu-

tional Semantic Models (DSMs) collectively refer to a

class of models where the meaning of a word is learned

by extracting statistical redundancies and co-occur-

rence patterns from natural language. Importantly,

DSMs provide explicit mechanisms for how words or

features for a concept may be learned from the natural

environment. Finally, feature models assume that

words are represented in memory as a distributed col-

lection of binary features (e.g., birds have wings,

whereas cars do not), and the correlation or overlap of

these features determines the extent to which words

have similar meanings (Smith, Shoben, & Rips, 1974;

Tversky, 1977). Overall, the network-based, feature-

based, and distributional approaches to semantic mod-

eling have sparked important debates in the literature

and informed our understanding of the different facets

involved in the construction of meaning. Therefore,

this review attempts to highlight important milestones

in the study of semantic memory, identify challenges

currently facing the field, and integrate traditional ideas

with modern approaches to modeling semantic

memory.

In a recent article, Günther, Rinaldi, and Marelli

(2019) reviewed several common misconceptions

about distributional semantic models and evaluated the

cognitive plausibility of modern DSMs. Although the

current review is somewhat similar in scope to Günther

et al.’s work, the current paper has different aims. Spe-

cifically, this review is a comprehensive analysis of

models of semantic memory across multiple fields and

tasks and so is not focused only on DSMs. It ties to-

gether classic models in psychology (e.g., associative

network models, standard DSMs, etc.) with current

state-of-the-art models in machine learning (e.g., trans-

former neural networks, convolutional neural net-

works, etc.) to elucidate the potential psychological

mechanisms that these fields posit to underlie semantic

retrieval processes. Further, the present work reviews

the literatures on modern multimodal semantic models,

compositional semantics, and newer retrieval-based

models, and therefore assesses these newer models and

applications from a psychological perspective. There-

fore, while Günther et al.’s review serves the role of

clarifying how DSMs may indeed represent a cogni-

tively plausible account of how meaning is learned, the

present review serves the role of presenting a more

comprehensive assessment and synthesis of multiple

classes of models, theories, and learning mechanisms,

as well as drawing closer ties between the rapidly pro-

gressing machine learning literature and the constraints

imposed by psychological research on semantic

memory—two fields that have so far been only loosely

connected to each other. Therefore, the goal of the pre-

sent review is to survey the current state of the field by

tying together work from psychology, computational

linguistics, and computer science, and also identify

new challenges to guide future empirical research in

modeling semantic memory.

1.1 Overview

This review emphasizes five important areas of re-

search in semantic memory. The first section presents a

modern perspective on the classic issues of semantic

memory representation and learning. Associative,

feature-based, and distributional semantic models are

introduced and discussed within the context of how

these models speak to important debates that have

emerged in the literature regarding semantic vs. associ-

ative relationships, prediction, and co-occurrence. In

particular, a distinction is drawn between distributional

models that propose error-free vs. error-driven learning

mechanisms for constructing meaning representations,

and the extent to which these models explain perfor-

mance in empirical tasks. Overall, although empirical

tasks have partly informed computational models of se-

mantic memory, the empirical and computational ap-

proaches to studying semantic memory have developed

somewhat independently. Therefore, the first section

attempts to bridge this gap by integrating empirical

findings from lexical decision, pronunciation, and cat-

egorization tasks, with modeling approaches such as

large-scale associative semantic networks (e.g., De

Deyne, Navarro, Perfors, Brysbaert, & Storms, 2019;

Steyvers & Tenenbaum, 2005), error-free learning

based DSMs (e.g., Jones & Mewhort, 2007; Landauer

& Dumais, 1997) as well as error-driven learning-

based models (e.g., Mikolov, Chen, Corrado, & Dean,

2013a).

The second section presents an overview of psycho-

logical research in favor of conceptualizing semantic

memory as part of a broader integrated memory system

(Jamieson et al., 2018; Kwantes, 2005; Yee, McRae &

Jones, 2018). The idea of semantic memory represen-

tations being context-dependent is discussed, based on

findings from episodic memory tasks, sentence pro-

cessing, and eye-tracking studies (e.g., Yee & Thomp-

son-Schill, 2016). These empirical findings are then in-

tegrated with modern approaches to modeling semantic

memory as a dynamic system that is sensitive to con-

textual dependencies, and can account for polysemy

and attentional influences through topic models (Grif-

fiths, Steyvers, & Tenenbaum, 2007), recurrent neural

networks (Elman, 1991; Peters et al., 2018), and atten-

tion-based neural networks (Devlin, Chang, Lee, &

Tou ta nov a, 20 19; Vas wa ni et a l. , 2 019). The rem ai nde r

of the section discusses the psychological plausibility

of a relatively new class of models (retrieval-based

models, e.g., Jamieson et al., 2018) that question the

need for “learning” meaning at all, and instead propose

that semantic representations are merely a product of

retrieval-based operations in response to a cue, there-

fore blurring the traditional distinction between seman-

tic and episodic memory (Tulving, 1972).

The third section discusses the issue of grounding,

and how sensorimotor input and environmental inter-

actions contribute to the construction of meaning. First,

empirical findings from sensorimotor priming and

cross-modal priming studies are discussed, that chal-

lenge the static, amodal, lexical nature of semantic

memory that has been the focus of the majority of com-

putational semantic models. There is now accumulat-

ing evidence that meaning cannot be represented exclu-

sively through abstract, amodal symbols such as words

(Barsalou, 2016). Therefore, important critiques of

amodal computational models are clarified in the extent

to which these models represent psychologically plau-

sible models of semantic memory that include percep-

tual motor systems. Next, state-of-the-art computa-

tional models such as convolutional neural networks

(Collobert et al., 2011), feature-integrated DSMs (An-

drews, Vigliocco, & Vinson, 2009; Howell, Jankowicz,

& Becker, 2005; Jones & Recchia, 2010), and multi-

modal DSMs (Bruni, Tran, & Baroni, 2014; Lazaridou,

Pham, & Baroni, 2015) are discussed within the con-

text of how these models are incorporating non-linguis-

tic information in the learning process and tackling the

grounding problem.

The fourth section focuses on the issue of composi-

tionality, i.e., how words can be effectively combined

and scaled up to represent higher-order linguistic struc-

tures like sentences, paragraphs, or even episodic

events. In particular, some early approaches to model-

ing compositional structures like vector addition (Lan-

dauer & Dumais, 1997), frequent phrase extraction

(Mikolov et al., 2013b), and finding linguistic patterns

in sentences (Turney & Pantel, 2010) are discussed.

The rest of the section focuses on modern approaches

to representing higher-order structures through hierar-

chical tree-based neural networks (Socher et al., 2013)

and modern recurrent neural networks (Elman &

McRae, 2019; Franklin, Norman, Ranganath, Zacks, &

Gershman, 2019).

The fifth and final section focuses on some open is-

sues in semantic modeling, such as proposing models

that can be applied to other languages, issues related to

data abundance and availability, understanding the so-

cial and evolutionary roles of language, and finding

mechanistic process-based accounts of model perfor-

mance. These issues shed light on important next steps

in the study of semantic memory and will be critical in

advancing our understanding of how meaning is con-

structed and guides cognitive behavior.

1.2 Many Tasks, Many Models

Before delving into the details of each of the sections,

it is important to emphasize here that models of seman-

tic memory are inextricably tied to the behaviors and

tasks that they seek to explain. For example, associa-

tive network models and early feature-based models

explained response latencies in sentence verification

tasks (e.g., deciding whether “a canary is a bird” is true

or false). Similarly, early semantic models accounted

for higher-order semantic relationships that emerge out

of similarity judgments (e.g., Osgood et al., 1957), alt-

hough several of these models have since been applied

to other tasks. Indeed, the study of meaning has

spanned a variety of tasks, models, and phenomena, in-

cluding but not limited to semantic priming effects in

lexical decision tasks (Balota & Lorch, 1986), false

memory paradigms (Deese, 1959; Roediger & McDer-

mott, 1995), sentence verification (Smith, Shoben, &

Rips, 1974), sentence comprehension (Duffy, Morris,

& Rayner, 1988; Rayner & Frazier, 1989), and argu-

ment reasoning (Niven and Kao, 2019) tasks.

Importantly, the cognitive processes underlying the

sentence verification task may vastly differ from those

underlying similarity judgments, which in turn may

also differ from the processes underlying other com-

plex tasks like reading comprehension and argument

reasoning, and it is unclear whether and how a model

of semantic memory that can successfully explain be-

havior in one task would be able to explain behavior in

an entirely different task.

Of course, the ultimate goal of the semantic model-

ing enterprise is to propose one model of semantic

memory that can be flexibly applied to a variety of se-

mantic tasks, in an attempt to mirror the flexible and

complex ways in which humans use knowledge and

language (see for example, Balota & Yap, 2006). How-

ever, it is important to underscore the need to separate

representational accounts from process-based accounts

in the field. Modern approaches to modeling the repre-

sentational nature of semantic memory have come very

far in describing the continuum in which meaning ex-

ists, i.e., from the lowest-level input in the form of sen-

sory and perceptual information, to words that form the

building blocks of language, to high-level structures

like schemas and events. However, process models op-

erating on these underlying semantic representations

have not received the same kind of attention and have

developed somewhat independently from the represen-

tation modeling movement. For example, although pro-

cess models like the drift-diffusion model (Ratcliff &

McKoon, 2008), the optimal foraging model (Hills,

2006), and the temporal context model (Howard & Ka-

hana, 2002) have been applied to some semantic tasks

like verbal fluency (Hills, Jones, & Todd, 2015), free

association (Howard, Shankar, & Jagadisan, 2011), and

semantic judgments (e.g., Pirrone, Marshall, & Staf-

ford, 2017), their application to different tasks remains

limited and most research has instead focused on rep-

resentational issues. Ultimately, combining process-

based accounts with representational accounts is going

to be critical in addressing some of the current chal-

lenges in the field, an issue that is emphasized in final

section of this review.

2 Semantic Memory Representation and

Learning

How individuals represent knowledge of concepts is

one of the most important questions in semantic

memory research and cognitive science. Therefore, sig-

nificant research on human semantic memory has fo-

cused on issues related to memory representation and

given rise to three distinct classes of models: associa-

tive network models, feature-based models, and distri-

butional semantic models. This section will present a

broad overview of these models, and also discuss some

important debates regarding memory representation

that these models have sparked in the field. Another re-

lated fundamental question in semantic memory re-

search is regarding the learning of concepts, and the re-

mainder of this section will focus on semantic models

that subscribe to two broad psychological mechanisms

(error-free and error-driven learning) that have been

posited to underlie the learning of meaning representa-

tions.

2.1 Semantic Memory Representation

2.1.1 Network-based Approaches

Network-based approaches to semantic memory have a

long and rich tradition rooted in psychology and

computer science. Collins and Quillian (1969)

investigated how individuals navigate through

semantic memory to verify the truth of sentences (e.g.,

the time taken to verify that a shark <has fins>), and

found that retrieval times were most consistent with a

hierarchically organized memory network (see Figure

1), where nodes represented words, and links or edges

represented semantic propositions about the words

(e.g., fish was connected to animal by an “is a” link,

and fish also had its own attributes such as <has fins>

and <can swim>). The mechanistic account of these

findings was through a spreading activation framework

(Quillian, 1967; Quillian, 1969), according to which

individual nodes in the network are activated, which in

turn leads to the activation of neighboring nodes, and

the network is traversed until the desired node or

proposition is reached and a response is made.

Interestingly, the number of steps taken to traverse the

path in the proposed memory network predicted the

time taken to verify a sentence in the original Collins

and Quillian (1969) model. However, the original

model could not explain typicality effects (e.g., why

individuals respond faster to “robin <is a> bird”

compared to “ostrich <is a> bird”) and also

encountered difficulties in explaining differences in

latencies for “false” sentences (e.g., why individuals

are slower to reject “butterfly <is a> bird” compared to

“dolphin <is a> bird”). Collins and Loftus (1975) later

proposed a revised network model where links between

words reflected the strength of the relationship, thereby

eliminating the hierarchical structure from the original

model to better account for behavioral patterns. This

network/spreading activation framework was

extensively applied to more general theories of

language, memory, and problem solving (e.g.,

Anderson, 2000).

Figure 1.Original network proposed by Collins and Quillian

(1969), reprinted from Balota &Coane (2008)

Computational network-based models of semantic

memory have gained significant traction in the past

decade, due to the recent popularity of graph theoretical

and network-science approaches to modeling cognitive

processes (for a review, see Siew, Wulff, Beckage, &

Kenett, 2018). Modern network-based approaches use

large-scale databases to construct networks and capture

large-scale relationships between nodes within the

network. This approach has been used to empirically

study the World Wide Web (Albert, Jeong, & Barabási,

2000; Barabási & Albert, 1999), biological systems

(Watts & Strogatz, 1998), language (Steyvers &

Ten en bau m, 2005; Vit evit ch , Chan & G ol dst ei n, 201 4) ,

and personality and psychological disorders (for

reviews, see Fried et al., 2017). Within the study of

semantic memory, Steyvers and Tenenbaum (2005)

pioneered this approach by constructing three different

semantic networks using large-scale free association

norms (Nelson et al., 2004), Roget’s Thesaurus (Roget,

1911), and WordNet (Fellbaum, 1998; Miller, 1995).

They presented several analyses to indicate that

semantic networks possessed a “small-world

structure”, as indexed by high clustering coefficients (a

parameter that estimates the likelihood that neighbors

of two nodes will be neighbors themselves) and short

average path lengths (a parameter that estimates the

average number of steps from one node in the network

to another), similar to several naturally occurring

networks. Importantly, network metrics such as path

length and clustering coefficients provide a

quantitative way of estimating the large-scale

organizational structure of semantic memory and the

strength of relationships between words in a network

(see Figure 2), and have also proven to be successful in

explaining performance across a wide variety of tasks,

such as relatedness judgments (De Deyne & Storms,

2008; Kenett, Levi, Anaki, & Faust, 2017; Kumar,

Balota, & Steyvers, 2019), verbal fluency (Abbott,

Austerweil, & Griffiths, 2015; Zemla & Austerweil,

2018), and naming (Steyvers & Tenenbaum, 2005).

Other work in this area has explored the influence of

semantic network metrics on psychological disorders

(Kenett, Gold, & Faust, 2016), creativity (Kenett,

Anaki, & Faust, 2014), and personality (Beaty et al.,

2016).

Figure 2. Large-scale visualization of a directed semantic

network created by Steyvers and Tenenbaum (2005) and

shortest path between RELEASE to ANCHOR, adapted

from Kumar, Balota, & Steyvers (2019).

Despite the success of modern semantic networks at

predicting cognitive performance, there is some

skepticism in the field regarding the use of free

association norms to create network representations

(Jones, Hills, & Todd, 2015; Siew et al., 2018).

Specifically, it is not clear whether networks

constructed from association norms are indeed a

representational account of semantic memory or

simply reflect the product of a retrieval-based process

on an underlying representation of semantic memory.

For example, producing the response ostrich to the

word emu represents a retrieval-based process cued by

the word emu, and may not necessarily reflect how the

underlying representation of the words came to be

closely associated in the first place. Therefore, it

appears that associative network models lack an

explicit mechanism through which concepts were

learned in the first place.

A r ecent examp le of thi s fundamenta l debate

regarding the origin of the representation comes from

research on the semantic fluency task, where

participants are presented with a natural category label

(e.g., “animals”) and are required to generate as many

exemplars from that category (e.g., lion, tiger,

elephant…) as possible within a fixed time period.

Hills, Jones, and Todd (2012) proposed that the

temporal pattern of responses produced in the fluency

task mimics optimal foraging techniques found among

animals in natural environments. They provided a

computational account of this search process based on

the BEAGLE model (Jones & Mewhort, 2007).

However, Abbott, Austerweil, and Griffiths (2015)

contended that the behavioral patterns observed in the

task could also be explained by a more parsimonious

random walk on a network representation of semantic

memory created from free association norms. This led

to a series of rebuttals from both camps (Jones, Hills,

& Todd, 2015; Nematzadeh, Miscevic, & Stevenson,

2016), and continues to remain an open debate in the

field (Avery & Jones, 2018). However, Jones, Hills,

and Todd (2015) argued that while free association

norms are a useful proxy for memory representation,

they remain an outcome variable from a search process

on a representation and cannot be a pure measure of

how semantic memory is organized. Indeed, Avery and

Jones (2018) showed that when the input to the network

and distributional space was controlled (i.e., both were

constructed from text corpora), random walk and

foraging-based models both explained semantic

fluency data, although the foraging model

outperformed several different random walk models.

Of course, these findings are specific to the semantic

fluency task and adequately controlled comparisons of

network models to DSMs remain limited. However,

this work raises the question of whether the success of

association networks in explaining behavioral

performance in cognitive tasks is a consequence of

shared variance with the cognitive tasks themselves

(e.g., fluency tasks can be thought of as association

tasks constrained to a particular category) or truly

reflects the structural representation of semantic

memory, an issue that is discussed in detail in the

section summary. Nonetheless, recent work in this area

has focused on creating network representations using

a learning model instead of behavioral data

(Nematzadeh, Miscevic, & Stevenson, 2016), and also

advocated for alternative representations that

incorporate such learning mechanisms and provide a

computational account of how word associations might

be learned in the first place.

2.1.2 Feature-based Approaches

Feature-based models depart from the traditional

notion that a word has a localized representation (e.g.,

in an association network). The core idea behind

feature models is that words are represented in memory

as a collection of binary features (e.g., birds have

wings, whereas cars do not), and the correlation or

overlap of these features determines the extent to which

words have similar meanings. Smith, Shoben, and Rips

(1974) proposed a feature-comparison model, in which

concepts had two types of semantic features: defining

features that are shared by all concepts, and

characteristic features that are specific to only some

exemplars. For example, all birds <have wings>

(defining feature) but not all birds <fly> (characteristic

feature). Similarity between concepts in this model was

computed through a feature comparison process, and

the degree of overlap between the features of two

concepts directly predicted sentence verification times,

typicality effects, and differences in response times in

responding to “false” sentences. This notion of featural

overlap as an index of similarity was also central to the

theory of feature matching proposed by Tversky

(1977). Tversky viewed similarity as a set-theoretical

matching function, such that the similarity between a

and b could be conceptualized through a contrast model

as a function of features that are common to both a and

b (common features), and features that belong to a but

not b, as well as features that belong to b but not a

(distinctive features). Tversky’s contrast model

successfully accounted for asymmetry in similarity

judgments and judgments of difference for words,

shapes, and letters.

Although early feature-based models of semantic

memory set the groundwork for modern approaches to

semantic modeling, none of the models had any

systematic way of measuring these features (e.g., Smith

et al. applied multidimensional scaling to similarity

ratings to uncover underlying features). Later versions

of feature-based models thus focused on explicitly

coding these features into computational models by

using norms from property generation tasks (McRae et

al., 1997). To obtain these norms, participants were

asked to list features for concepts (e.g., for the word

ostrich, participants may list <is a> bird, <has wings>,

<is heavy> and <does not fly> as features), the idea

being that these features constitute explicit knowledge

participants have about a concept. McRae et al. then

used these features to train a model using simple

correlational learning algorithms (see next subsection)

applied over a number of iterations, which enabled the

network to settle into a stable state that represented a

learned concept. A critical result of this modeling

approach was that correlations among features

predicted response latencies in feature verification

tasks in human participants as well as model

simulations. Importantly, this approach highlighted

how statistical regularities among features may be

encoded in a memory representation over time.

Subsequent work in this line of research demonstrated

how feature correlations predicted differences in

priming for living and nonliving things and explained

typicality effects (McRae, 2004).

Despite the success of computational feature-based

models, an important limitation common to both

network and feature-based models was their inability

to explain how knowledge of individual features or

concepts was learned in the first place. For example,

while feature-based models can explain that ostrich

and emu are similar because both <have feathers>, how

did an individual learn that <having feathers> is a

feature that an ostrich or emu has? McRae et al.

claimed that features were derived from repeated

multimodal interactions with exemplars of a particular

concept, but how this learning process might work in

practice was missing from the implementation of these

models. Still, feature-based models have been very

useful in advancing our understanding of semantic

memory structure, and the integration of feature-based

information with modern machine-learning models

continues to remain an active area of research (see

Section III).

2.1.3 Distributional Approaches

Distributional Semantic Models (DSMs) refer to a

class of models that provide explicit mechanisms for

how words or features for a concept may be learned

from the natural environment. Therefore, unlike

associative network models or feature-based models,

DSMs do not use free association responses or feature

norms, but instead build representations by directly

extracting statistical regularities from a large natural

language corpus (e.g., books, newspapers, online

articles etc.), the assumption being that large text

corpora are a good proxy for the language that

individuals are exposed to in their lifetime. The

principle of extracting co-occurrence patterns and

inferring associations between concepts/words from a

large text-corpus is at the core of all DSMs, but exactly

how these patterns are extracted has important

implications for how these models conceptualize the

learning process. Specifically, two distinct

psychological mechanisms have been proposed to

account for associative learning, broadly referred to as

error-free and error-driven learning mechanisms.

Error-free learning mechanisms refer to a class of

psychological mechanisms that posit that learning

occurs by identifying clusters of events that tend to co-

occur in temporal proximity, and dates back to Hebb’s

(1949; also see McCulloch and Pitts, 1943) proposal of

how individual neurons in the brain adjust their firing

rates and activations in response to activations of other

neighboring neurons. This Hebbian learning

mechanism is at the heart of several classic and recent

models of semantic memory, which will be discussed

in this section. On the other hand, error-driven learning

mechanisms posit that learning is accomplished by

predicting events in response to a stimulus, and then

applying an error-correction mechanism to learn

associations. Error-correction mechanisms often vary

across learning models but broadly share principles

with Rescorla and Wagner’s (1972) model of animal

cognition, where they described how learning may

actually be driven by expectation error, instead of error-

free associative learning (Rescorla, 1988). This section

will review DSMs that are consistent with the error-

free and error-driven learning approaches to

constructing meaning representations, and the

summary section will discuss the evidence in favor of

and against each class of models.

2.2 Semantic Memory Learning

2.2.1 Error-free Learning-based DSMs

One of the earliest DSMs, the Hyperspace Analogue to

Language (HAL; Lund & Burgess, 1996) built

semantic representations by counting the co-

occurrences of words within a sliding window of 5 to

10 words, where co-occurrence between any two words

was inversely proportional to the distance between the

two words in that window. These local co-occurrences

produced a word-by-word co-occurrence matrix which

served as a spatial representation of meaning, such that

words that were semantically related were closer in a

high-dimensional space (see Figure 3; ear, eye, and

nose all acquire very similar representations).

Figure 3. The high-dimensional space produced by HAL

from co-occurrence word vectors, adapted from Lund &

Burgess (1996).

This relatively simple error-free learning mechanism

was able to account for a wide variety of cognitive

phenomena in tasks such as lexical decision and

categorization (Li, Burgess & Lund, 2000). However,

HAL encountered difficulties in accounting for

mediated priming effects (Livesay & Burgess, 1998;

see section summary for details), which was considered

as evidence in favor of semantic network models.

However, despite its limitations, HAL was an

important step in the ongoing development of DSMs.

Another popular distributional model that has been

widely applied across cognitive science is Latent

Semantic Analysis (LSA; Landauer & Dumais, 1997),

a semantic model that has successfully explained

performance in several cognitive tasks such as

semantic similarity (Landauer & Dumais, 1997),

discourse comprehension (Kintsch, 1998), and essay

scoring (Landauer, Laham, Rehder, & Schreiner,

1997). LSA begins with a word-document matrix of a

text corpus, where each row represents the frequency

of a word in each corresponding document, which is

clearly different from HAL’s word-by-word co-

occurrence matrix. Further, unlike HAL, LSA first

transforms these simple frequency counts into log

frequencies weighted by the word’s overall importance

over documents, to deemphasize the influence of

unimportant frequent words in the corpus. This

transformed matrix is then factorized using truncated

singular value decomposition, a factor-analytic

technique used to infer latent dimensions from a multi-

dimensional representation. The semantic

representation of a word can then be conceptualized as

an aggregate or distributed pattern across a few

hundred dimensions. The construction of a word-by-

document matrix and the dimensionality reduction step

are central to LSA and have the important consequence

of uncovering global or indirect relationships between

words even if they never co-occurred with each other

in the original context of documents. For example, lion

and stripes may have never co-occurred within a

sentence or document, but because they often occur in

similar contexts of the word tiger, they would develop

similar semantic representations. Importantly, the

ability to infer latent dimensions and extend the context

window from sentences to documents differentiates

LSA from a model like HAL.

Despite its widespread application and success,

LSA has been criticized on several grounds over the

years, e.g., for ignoring word transitions (Perfetti,

1998), violating power laws of connectivity (Steyvers

& Tenenbaum, 2005) and the lack of a mechanism for

learning incrementally (Jones, Willits, & Dennis,

2015). The last point is particularly important, as the

LSA model assumes that meaning is learned and

computed after a large amount of co-occurrence

information is available (i.e., in the form of a word by

document matrix). This is clearly unconvincing from a

psychological standpoint and is often cited as a reason

for distributional models being implausible

psychological models (Hoffman, McClelland, &

Lambon Ralph, 2018; Sloutsky, Yim, Yao, & Dennis,

2017). However, as Günther, Rinaldi, and Marelli

(2019) have recently noted, this is an argument against

batch-learning models like LSA, and not distributional

models per se. In principle, LSA can learn

incrementally by updating the co-occurrence matrix as

each input is received and re-computing the latent

dimensions (for a demonstration, see Olney, 2011),

although this process would be computationally very

expensive. In addition, there are several modern DSMs

that are incremental learners and propose

psychologically plausible accounts of semantic

representation.

One such incremental approach involves develop-

ing random representations of words that slowly

accumulate information about meaning through

repeated exposure to words in a large text corpus. For

example, Bound Encoding of the Aggregate Language

Environment (BEAGLE; Jones & Mewhort, 2007) is a

random vector accumulation model that gradually

builds semantic representations as it processes text in

sentence-sized context windows. BEAGLE begins by

assigning a random, static environmental vector to a

word the first time it is encountered in the corpus. This

environmental vector does not change over different

exposures of the word and is hypothesized to represent

stable physical characteristics about the word. When

words co-occur in a sentence, their environmental

vectors are added to each other’s representations, and

thus, their memory representations become similar

over time. Further, even if two words never co-occur,

they develop similar representations if they co-occur

with the same words. This leads to the formation of

higher-order relationships between words, without

performing any LSA-type dimensionality reduction.

Importantly, BEAGLE integrates this context-based

information with word-order information using a

technique called circular convolution (an effective

method to combine two n-dimensional vectors into an

associated vector of the same dimensions). BEAGLE

computes order information by binding together all

word chunks (formally called n-grams) that a particular

word is part of (e.g., for the sentence “an ostrich

flapped its wings”, the 2-gram convolution would bind

the representations for <an, ostrich> and <ostrich,

flapped> together) and then summing this order vector

with the word’s context vector to compute the final

semantic representation of the word. Thus, words that

co-occur in similar contexts as well as in the same

syntactic positions develop similar representations as

the model acquires more experience through the

corpus. BEAGLE outperforms several classic models

of word representation (e.g., LSA and HAL), and

explains performance on several complex tasks, such

as mediated priming effects in lexical decision and

pronunciation tasks, typicality effects in exemplar

categorization, and reading times in stem completion

tasks (Jones & Mewhort, 2007). Importantly, through

the addition of environmental vectors of words

whenever they co-occur, BEAGLE also indirectly

infers relationships between words that did not directly

co-occur. This process is similar in principle to

inferring indirect co-occurrences across documents in

LSA and can be thought of as an abstraction-based

process applied to direct co-occurrence patterns, albeit

through different mechanisms. Other incremental

models use ideas similar to BEAGLE for accumulating

semantic information over time, although they differ in

their theoretical underpinnings (Howard, Shakar, &

Jagadisan, 2011; Sahlgren, Holst, & Kanerva, 2008)

and the extent to which they integrate order

information in the final representations (Kanerva,

2009). It is important to note here that the DSMs

discussed so far (HAL, LSA, and BEAGLE) all share

the principle of deriving meaning representations

through error-free learning mechanisms, in the spirit of

Hebbian associative learning. The following section

discusses other DSMs that also produce rich semantic

representations but are instead based on error-driven

learning mechanisms or prediction.

2.2.2 Error-Driven Learning-based DSMs.

In contrast to error-free learning DSMs, a different ap-

proach to building semantic representations has fo-

cused on how representations may slowly develop

through prediction and error-correction mechanisms.

These models are also referred to as connectionist mod-

els and propose that meaning emerges through predic-

tion-based weighted interactions between intercon-

nected units (Rumelhart, Hinton, & McClelland, 1986).

Most connectionist models typically consist of an input

layer, an output layer and one or more intervening units

collectively called the hidden layers, each of which

contains one or more “nodes” or units. Activating the

nodes of the input layer (through an external stimulus)

leads to activation or suppression of units connected to

the input units, as a function of the weighted connec-

tion strengths between the units. Activation gradually

reaches the output units, and the relationship between

output units and input units is of primary interest.

Learning in connectionist models (sometimes called

feed-forward networks), can be accomplished in a su-

pervised or unsupervised manner. In supervised learn-

ing, the network tries to maximize the likelihood of a

desired goal or output for a given set of input units by

predicting outputs at every iteration. The weights of the

signals are thus adjusted to minimize the error between

the target output and the network’s output, through er-

ror backpropagation (Rumelhart, Hinton, & Williams,

1988). In unsupervised learning, weights within the

network are adjusted based on the inherent structure of

the data, which is used to inform the model about pre-

diction errors (e.g., Mikolov et al., 2013a; 2013b).

Rumelhart and Todd (1993) proposed one of the

first feed-forward connectionist models of semantic

memory. To train the network, all facts represented in a

traditional semantic network (e.g., Collins & Quillian,

1969) were first converted to input-output training

pairs (e.g., the fact, bird <has wings> was converted to

term 1: bird – relation: has – term 2: wings). Then, the

network learned semantic representations in a super-

vised manner, by turning on the input and relation

units, and backpropagating the error from predicted

output units through two hidden layers. For example,

the words oak and pine acquired a similar pattern of

activation across the hidden units because their node-

relations pairs were similar during training. Addition-

ally, the network was able to hierarchically learn infor-

mation about new concepts (e.g., adding the sparrow

<is a> bird link in the model formed a new representa-

tion for sparrow that also included relations like <has

wings>, <can fly> etc.). Connectionist networks are

sometimes also called neural networks (NNs), to em-

phasize that connectionist models (old and new) are in-

spired by neurobiology and attempt to model how the

brain might process incoming input and perform a par-

ticular task, although this is a very loose analogy and

modern researchers do not view neural networks as ac-

curate models of the brain (Bengio, Goodfellow, &

Courville, 2015).

A f eed-forward NN model, word2vec, proposed by

researchers at Google (Mikolov, Chen, Corrado, &

Dean, 2013a) has gained immense popularity in the last

few years due to its impressive performance on a vari-

ety of semantic tasks. Word2vec is a two-layer NN

model that has two versions: continuous bag-of-words

(CBOW) and skip-gram. The objective of the CBOW

model is to predict a target word, given four context

words before and after the intended word, using a clas-

sifier. The skip-gram model reverses this objective and

attempts to predict the surrounding context words,

given an input word (see Figure 4). In this way,

word2vec trains the network on a surrogate task and it-

eratively improves the word representations or “em-

beddings” (represented via the hidden layer units)

formed during this process by computing stochastic

gradient descent, a common technique to compute pre-

diction error for backpropagation in NN models. Fur-

ther, word2vec tweaks several hyperparameters to

achieve optimal performance. For example, it uses dy-

namic context windows so that words that are more dis-

tant from the target word are sampled less frequently in

the prediction task. Additionally, word2vec deempha-

sizes the role of frequent words by discarding frequent

words above a threshold with some probability. Finally,

to refine representations, word2vec uses negative sam-

pling, by which the model randomly samples a set of

unrelated words and learns to suppress these words

during prediction. These sophisticated techniques al-

low word2vec to develop very rich semantic represen-

tations. For example, word2vec is able to solve verbal

analogy problems, e.g., man: king :: woman: ??,

through simple vector arithmetic (but see Chen, Peter-

son, & Griffiths, 2017), and also model human similar-

ity judgments. This indicates that the representations

acquired by word2vec are sensitive to complex higher-

order semantic relationships, a characteristic that had

not been previously observed or demonstrated in other

NN models. Further, word2vec is a very weakly super-

vised (or unsupervised) learning algorithm, as it does

not require labeled or annotated data but only sequen-

tial text (i.e., sentences) to generate the word embed-

dings. word2vec’s pretrained embeddings have proven

to be useful inputs for several downstream natural lan-

guage processing tasks (Collobert & Weston, 2008) and

have inspired several other embedding models. For ex-

ample, fastText (Bojanowski, Grave, Joulin, &

Mikolov, 2017) is a word2vec-type NN that incorpo-

rates character-level information (i.e., n-grams) in the

learning process, which leads to more fine-grained rep-

resentations for rare words and words that are not in the

training corpus. However, the psychological validity of

some of the hyperparameters used by word2vec has

been called into question by some researchers. For

Figure 4. A d epi cti on of th e s kip-gram version of the word2vec model architecture. The model is creating a vector

representation for the word lived by predicting its surrounding words in the sentence “Jane’s mother lived in Paris”. The

weights of the hidden layer represent the vector representation for the word lived, as the model performs the prediction task

and adjusts the weights based on the prediction error. Adapted from Günther et al. (2019).

example, Johns, Mewhort, and Jones (2019) recently

investigated how negative sampling, which appears to

be psychologically unintuitive, affects semantic repre-

sentations. They argued that negative sampling simply

establishes a more accurate base rate of word occur-

rence and proposed solutions to integrate base-rate in-

formation into BEAGLE without the need to randomly

sample unrelated words or even a prediction mecha-

nism. However, as discussed in subsequent sections,

prediction appears to be a central mechanism in certain

tasks that involve sequential dependencies, and it is

possible that NN models based on prediction are indeed

capturing these long-term dependencies.

Another modern distributional model, Global Vec-

tors (GloVe), was recently introduced by Pennington,

Socher, and Manning (2014) shares similarities with

both error-free and NN-based error-driven models of

word representation. Similar to several DSMs, GloVe

begins with a word-by-word co-occurrence matrix.

But, instead of using raw counts as a starting point,

GloVe estimates the ratio of co-occurrence probabili-

ties between words. To give an example used by the

authors, based on statistics from text corpora, ice co-

occurs more frequently with solid than it does with gas,

whereas steam co-occurs more frequently with gas than

it does with solid. Further, both words (ice and steam)

co-occur with their shared property water frequently,

and both co-occur with the unrelated word fashion in-

frequently. The critical insight that GloVE capitalizes

on is that words like water and fashion are non-discrim-

inative, but the words gas and solid are important in

differentiating between ice and steam. The ratio of

probabilities highlights these differences, such that

large values (much greater than 1) correspond to prop-

erties specific to ice, and small values (much less than

1) correspond to properties specific of steam (see Fig-

ure 5). In this way, co-occurrence ratios successfully

capture abstract concepts such as thermodynamic

phases. GloVe aims to predict the logarithm of these

co-occurrence ratios between words using a regression

model, in the same spirit as factorizing the logarithm of

the co-occurrence matrix in LSA. Therefore, while in-

corporating global information in the learning process

(similar to LSA), GloVe also uses error-driven mecha-

nisms to minimize the cost function from the regression

model (using a modified version of stochastic gradient

descent, similar to word2vec), and therefore represents

a type of hybrid model. Further, to deemphasize the

overt influence of frequent and rare words, GloVe pe-

nalizes words with very high and low frequency (simi-

lar to importance weighting in LSA). The final ab-

stracted representations or “embeddings” that emerge

from the GloVe model are particularly sensitive to

higher-order semantic relationships, and the GloVe

model has been shown to perform remarkably well at

analogy tasks, word similarity judgments, and named

entity recognition (Pennington et al., 2014), although

there is little consensus in the field regarding the rela-

tive performance of GloVe against strictly prediction-

based models (e.g., word2vec; see Baroni et al,, 2014;

Levy and Goldberg, 2014).

2.3 Summary

This section provided a detailed overview of traditional

and recent computational models of semantic memory

and highlighted the core ideas that have inspired the

field in the past few decades with respect to semantic

memory representation and learning. While several

models draw inspiration from psychological principles,

the differences between them certainly have

implications for the extent to which they explain

behavior. This summary focuses on the extent to which

associative network and feature-based models, as well

as error-free and error-driven learning-based DSMs

speak to important debates regarding association,

direct and indirect patterns of co-occurrence, and

prediction.

2.3.1 Semantic vs. Associative Relationships.

Within the network-based conceptualization of

semantic memory, concepts that are related to each

other are directly connected (e.g., ostrich and emu have

a direct link). An important insight that follows from

this line of reasoning is that if ostrich and emu are

indeed related, then processing one of the words should

facilitate processing for the other word. This was

indeed the observation made by Meyer and

Schvaneveldt (1971), who reported the first semantic

priming study, where they found that individuals were

faster to make lexical decisions (deciding whether a

presented stimulus was a word or non-word) for

semantically related (e.g., ostrich-emu) word pairs,

compared to unrelated word pairs (e.g., apple-emu).

Given that individuals were not required to access the

semantic relationship between words to make the

lexical decision, these findings suggested that the task

potentially reflected automatic retrieval processes

operating on underlying semantic representations (also

see Neely, 1997). The semantic priming paradigm has

since become the most widely applied task in cognitive

psychology to examine semantic representation and

processes (for reviews, see Hutchison, 2003; Lucas,

2000; Neely, 2012).

An important debate that arose within the semantic

priming literature was regarding the nature of the

relationship that produces the semantic priming effect

as well as the basis for connecting edges in a semantic

network. Specifically, does processing the word ostrich

facilitate the processing of the word emu due to the

associative strength of connections between ostrich

and emu, or because the semantic features that form the

concepts of ostrich and emu largely overlap? As

discussed earlier, associative relations are thought to

reflect contiguous associations that individuals likely

Figure 5. Ratio of co-occurrence probabilities for ice and

steam, as described in Pennington et al. (2014).

infer from natural language (e.g., ostrich-egg).

Traditionally, such associative relationships have been

operationalized through responses in a free association

task (e.g., De Deyne, Navarro, Perfors, Brysbaert, &

Storms, 2019; Nelson et al., 2004). On the other hand,

semantic relations have traditionally included only

category coordinates or concepts with similar features

(e.g., ostrich-emu; Hutchison, 2003; Lucas, 2000).

Given these different operationalizations, some

researchers have attempted to isolate pure “semantic”

priming effects by selecting items that are semantically

related (i.e., share category membership; Fischler,

1977; Lupker, 1984; Thompson-Schill, Kurtz, &

Gabrieli, 1998) but not associatively related (i.e., based

on free association norms), although these attempts

have not been successful. Specifically, there appear to

be discrepancies in how associative strength is defined

and the locus of these priming effects. For example, in

a meta-analytic review, Lucas (2000) concluded that

semantic priming effects can indeed be found in the

absence of associations, arguing for the existence of

“pure” semantic effects. In contrast, Hutchison (2003)

revisited the same studies and argued that both

associative and semantic relatedness can produce

priming, and the effects largely depend on the type of

semantic relation being investigated as well as the task

demands (also see Balota & Paul, 1996).

Another line of research in support of associative

influences underlying semantic priming comes from

studies on mediated priming. In a typical experiment,

the prime (e.g., lion) is related to the target (e.g.,

stripes) only through a mediator (e.g., tiger) which is

not presented during the task. The critical finding is

that robust priming effects are observed in

pronunciation and lexical decision tasks for mediated

word pairs that do not share any obvious semantic

relationship or featural overlap (Balota & Lorch, 1986;

Livesay & Burgess, 1998; McNamara & Altarriba,

1988). Traditionally, mediated priming effects have

been explained through an associative-network based

account of semantic representation (e.g., Balota &

Lorch, 1986), where, consistent with a spreading

activation mechanism, activation from the prime node

(e.g., lion) spreads to the mediator node in the network

(e.g., tiger), which in turn activates the related target

node (e.g., stripes). Recent computational network

models have supported this conceptualization of

semantic memory as an associative network. For

example, Kenett, Levi, Anaki, and Faust (2017)

constructed a Hebrew network based on correlations of

responses in a free association task, and showed that

network path lengths in this Hebrew network

successfully predicted the time taken by participants to

decide whether two words were related or unrelated,

for directly related (e.g., bus-car) and relatively distant

word pairs (e.g., cheater-carpet). More recently,

Kumar, Balota, and Steyvers (2019) replicated Kenett

et al.’s work in a much larger corpus in English, and

also showed that undirected and directed networks

created by Steyvers and Tenenbaum (2005) also

account for such distant priming effects.

While network models provide a straightforward

account for mediated (and distant) priming, such

effects were traditionally considered a core challenge

for feature-based and distributional semantic models

(Hutchison, 2003; Masson, 1995; Plaut & Booth,

2000). The argument was that in feature-based

representations that conceptualize word meaning

through the presence or absence of features, lion and

stripes would not overlap because lions do not have

stripes. Similarly, in distributional models, at least

some early evidence from the HAL model suggested

that mediated word pairs neither co-occur nor have

similar high-dimensional vector representations

(Livesay & Burgess, 1998), which was taken as

evidence against a distributional representation of

semantic memory. However, other distributional

models such as LSA and BEAGLE have since been

able to account for mediated priming effects (e.g.,

Chwilla & Kolk, 2002; Hutchison, 2003; Jones,

Kintsch, & Mewhort, 2006; Jones & Mewhort, 2007;

Kumar, Balota, & Steyvers, 2019). In fact, Jones,

Kintsch, and Mewhort (2006) showed that HAL’s

greater focus on “semantic” relationships contributes to

its inability to account for mediated priming effects,

that are more “associative” in nature (also see Sahlgren,

2008). However, LSA and other DSMs that subscribe

to a broader conceptualization of meaning, that

includes both local “associative” as well as global

“semantic” relationships are indeed able to account for

mediated priming effects. The counterargument is that

mediated priming may simply reflect weak semantic

relationships between words (McKoon & Ratcliff,

1992), that can indeed be learned from statistical

regularities in natural language. Thus, even though lion

and stripes may have never co-occurred, newer

semantic models that capitalize on higher-order

indirect relationships are able to extract similar vectors

for these words and produce the same priming effects

without the need for a mediator or a spreading

activation mechanism (Jones, Kintsch, & Mewhort,

2006).

Therefore, an important takeaway from these

studies on clarifying the locus of semantic priming

effects is that the traditional distinction between

associative and semantic relations may need to be

revisited. Importantly, the operationalization of

associative relations through free association norms

has further complicated this distinction, as only

responses that are produced in free association tasks

have been traditionally considered to be associative in

nature. However, free association responses may

themselves reflect a wide variety of semantic relations

(McRae et al., 2012; see also Guida & Lenci, 2007) that

can produce different types of semantic priming

(Hutchison, 2003). Indeed, as McRae, Khalkhali, and

Hare (2012) noted, several of the associative level

relations examined in previous work (e.g., Lucas,

2000) could in fact be considered semantically related

in the broad sense (e.g., scene, feature, and script

relations). Within this view, it is unclear exactly how

associative relations operationalized in this way can be

truly separated from semantic relations, or conversely,

how semantic relations could truly be considered any

different from simple associative co-occurrence. In

fact, it is unlikely that words are purely associative or

purely semantically related. As McNamara (2005)

noted, “Having devoted a fair amount of time perusing

free association norms, I challenge anyone to find two

highly associated words that are not semantically

related in some plausible way” (McNamara, 2005; pp.

86). Furthermore, the traditional notion of what

constitutes a “semantic” relationship has changed and

is no longer limited to only coordinate or feature-based

overlap, as is evidenced by the DSMs discussed in this

section. Therefore, it appears that both associative

relationships as well as coordinate/feature relationships

now fall within the broader umbrella of what is

considered semantic memory.

There is one possible way to reconcile the historical

distinction between what are considered traditionally

associative and “semantic” relationships. Some

relationships may be simply dependent on direct and

local co-occurrence of words in natural language (e.g.,

ostrich and egg frequently co-occur in natural

language), whereas other relationships may in fact

emerge from indirect co-occurrence (e.g., ostrich and

emu do not co-occur with each other, but tend to co-

occur with similar words). Within this view,

traditionally “associative” relationships may reflect

more direct co-occurrence patterns, whereas

traditionally “semantic” relationships, or

coordinate/featural relations may reflect more indirect

co-occurrence patterns. As discussed in this section,

DSMs often distinguish between, and differentially

emphasize these two types of relationships (i.e., direct

vs. indirect co-occurrences, see Jones, Kintsch, &

Mewhort, 2006), which has important implications for

the extent to which these models speak to this debate

between associative vs. truly semantic relationships.

The combined evidence from the semantic priming

literature and computational modeling literature

suggests that the formation of direct associations is

most likely an initial step in the computation of

meaning. However, it also appears that the complex

semantic memory system does not simply reply on

these direct associations but also applies additional

learning mechanisms (vector accumulation,

abstraction, etc.) to derive other meaningful, indirect

semantic relationships. Implementing such global

processes allows modern distributional models to

develop more fine-grained semantic representations

that capture different types of relationships (direct and

indirect). However, there do appear to be important

differences in the underlying mechanisms of meaning

construction posited by different DSMs. Further, there

is also some concern in the field regarding the reliance

on pure linguistic corpora to construct meaning

representations (De Deyne, Perfors, & Navarro, 2016),

an issue that is closely related to the assessing the role

of associative networks and feature-based models in

understanding semantic memory, as discussed below.

Furthermore, it is also unlikely that any semantic

relationships are purely direct or indirect and may

instead fall on a continuum, which echoes the

arguments posed by Hutchison (2003) and Balota and

Paul (1996) regarding semantic vs. associative

relationships.

2.3.2 Va lu e o f A ss oc i at i ve N et wo r ks a nd Fea-

ture-based Models.

Another important part of this debate on associative

relationships are the representational issues posed by

association network models and feature-based models.

As discussed earlier, the validity of associative

semantic networks and feature-based models as

accurate models of semantic memory has been called

into question (Jones, Hills, & Todd, 2015), due to the

lack of explicit mechanisms for learning relationships

between words. One important observation from this

work is that the debate is less about the underlying

structure (network-based/localist or distributed) and

more about the input contributing to the resulting

structure. Networks and feature lists in and of

themselves are simply tools to represent a particular set

of data, similar to high-dimensional vector spaces. As

such, cosines in vector spaces can be converted to step-

based distances that form a network using cosine

thresholds (e.g., Gruenenfelder, Recchia, Rubin, &

Jones, 2016; Steyvers & Tenenbaum, 2005) or a binary

list of features (similar to “dimensions” in DSMs).

Therefore, the critical difference between associative

networks/feature-based models and DSMs is not that

the former is a network/list and the latter is a vector

space, but the fact that associative networks are

constructed from free association responses, feature-

based models use property norms, and DSMs learn

from text corpora. Therefore, as discussed earlier, the

success of associative networks (or feature-based

models) in explaining behavioral performance in

cognitive tasks could be a consequence of shared

variance with the cognitive tasks themselves. However,

associative networks also explain performance in tasks

that are arguably not based solely on retrieving

associations or features, for example, progressive

demasking (Kumar, Balota, & Steyvers, 2019),

similarity judgments (Richie, Zou, & Bhatia, 2019),

and the remote triads task where participants are asked

to choose the most related pair among a set of three

nouns (De Deyne, Perfors, & Navarro, 2016). This

points to the possibility that the part of the variance

explained by associative networks or feature-based

models may in fact be meaningful variance that

distributional models are unable to capture, instead of

entirely being shared task-based variance. To the extent

that DSMs are limited by the corpora they are trained

on (Recchia & Jones, 2009), it is possible that the

responses from free association tasks and property

generation norms capture some non-linguistic aspects

of meaning that are missing from standard DSMs, e.g.,

imagery, emotion, perception, etc.

Therefore, even though it is unlikely that

associative networks and feature-based models are a

complete account of semantic memory, the free

association and property generation norms that they are

constructed from are likely useful baselines to compare

DSMs against, because they include different types of

relationships that go beyond those observable in textual

corpora (De Deyne, Perfors, & Navarro, 2016). To that

end, Gruenenfelder, Recchia, Rubin, and Jones (2016)

compared three distributional models (LSA, BEAGLE

and Topic models) and one simple associative model

and indicated that only a hybrid model that combined

contextual similarity and associative networks

successfully predicted the graph theoretic properties of

free association norms (also see Richie, White, Hout,

& Bhatia, 2019). Therefore, associative networks and

feature-based models can potentially capture

complementary information compared to standard

distributional models and may provide additional cues

about the features and associations other than co-

occurrence that may constitute meaning. For instance,

there is evidence to show that perceptual features such

as size, color, texture that are readily apparent to

humans and may be used to infer semantic

relationships, are not effectively captured by co-

occurrence statistics derived from natural language

corpora (e.g., Baroni & Lenci, 2008, see Section III),

suggesting that semantic memory may in fact go

beyond simple co-occurrence. Indeed, as discussed in

Section III, multimodal and feature-integrated DSMs

that use different linguistic and non-linguistic sources

of information to learn semantic representations are

currently a thriving area of research and are slowly

changing the conceptualization of what constitutes

semantic memory (e.g., Bruni, Tran, & Baroni, 2014;

Lazaridou et al., 2015).

2.3.3 Error-free vs. Error-driven Learning

Prediction is another contentious issue in semantic

modeling that has gained a considerable amount of

traction in recent years, and the traditional distinction

between error-free Hebbian learning and error-driven

Rescorla-Wagner-type learning has been carried over

to debates between different DSMs in the literature. In

particular, DSMs that are based on extracting

temporally contiguous associations via error-free

learning mechanisms to derive word meanings (e.g.,

HAL, LSA, BEAGLE, etc.) have been referred to as

“count-based” models in computational linguistics and

natural language processing, and have been contrasted

against DSMs that employ a prediction-based

mechanism to learn representations (e.g., word2vec,

fastText, etc.), often referred to as “predict” models. It

is important to note here that the count vs. predict

distinction is somewhat artificial and misleading,

because even prediction-based DSMs effectively use

co-occurrence counts of words from natural language

corpora to generate predictions. The important

difference between these models is therefore not that

one class of models counts co-occurrences whereas the

other predicts them, but in fact that one class of models

employs an error-free Hebbian learning process

whereas the other class of models employs a

prediction-based error-driven learning process to learn

direct and indirect associations between words.

Nonetheless, in an influential paper, Baroni, Dinu, and

Kruszewski (2014) compared 36 “count-based” or

error-free learning-based DSMs to 48 “predict” or

error-driven learning-based DSMs and concluded that

error-driven learning-based (predictive) models

significantly outperformed their Hebbian learning-

based counterparts in a large battery of semantic tasks.

Additionally, Mandera, Keuleers, and Brysbaert (2017)

compared the relative performance of error-free

learning-based DSMs (LSA and HAL-type) and error-

driven learning-based models (CBOW and skip-gram

versions of word2vec) on semantic priming tasks

(Hutchison et al., 2013) and concluded that predictive

models provided a better fit to the data. They also

argued that predictive models are psychologically more

plausible because they employ error-driven learning

mechanisms consistent with principles posited by

Rescorla and Wagner (1972) and are computationally

more compact.

However, the argument that predictive models

employ psychologically plausible learning

mechanisms is incomplete, because error-free learning-

based DSMs also employ equally plausible learning

mechanisms, consistent with Hebbian learning

principles. Further, there is also some evidence

challenging the resounding success of predictive

models. Asr, Willits, and Jones (2016) compared an

error-free learning-based model (similar to HAL), a

random vector accumulation model (similar to

BEAGLE), and word2vec in their ability to acquire

semantic categories when trained on child-directed

speech data. Their results indicated that when the

corpus was scaled down to stimulus available to

children, the HAL-like model outperformed word2vec.

Other work has also found little to no advantage of

predictive models over error-free learning-based

models (De Deyne, Perfors, & Navarro, 2016; Recchia

& Nulty, 2017). Additionally, Levy, Goldberg, and

Dagan (2015) showed that hyperparameters like

window sizes, subsampling, and negative sampling can

significantly affect performance, and it is not the case

that predictive models are always superior to error-free

learning-based models. Collectively, these results point

to two possibilities. First, it is possible that large

amounts of training data (e.g., a billion words) and

hyperparameter tuning (e.g., subsampling or negative

sampling) are the main factors contributing to

predictive models showing the reported gains in

performance compared to their Hebbian learning

counterparts. To address this possibility, Levy and

Goldberg (2014) compared the computational

algorithms underlying error-free learning-based

models and predictive models and showed that the

skip-gram word2vec model implicitly factorizes the

word-context matrix, similar to several error-free

learning-based models such as LSA. Therefore, it does

appear that predictive models and error-free learning-

based models may not be as different as initially

conceived, and both approaches may actually converge

on the same set of psychological principles. Second, it

is possible that predictive models are indeed capturing

a basic error-driven learning mechanism that humans

use to perform certain types of complex tasks that

require keeping track of sequential dependencies, such

as sentence processing, reading comprehension, and

event segmentation. Subsequent sections will discuss

how state-of-the-art approaches specifically aimed at

explaining performance in such complex semantic

tasks are indeed variants or extensions of this

prediction-based approach, suggesting that these

models currently represent a promising and

psychologically intuitive approach to semantic

representation.

Language is clearly an extremely complex behavior

and even though modern DSMs like word2vec and

GloVe that are trained on vast amounts of data

successfully explain performance across a variety of

tasks, adequate accounts of how humans generate

sufficiently rich semantic representations with

arguably lesser “data” are still missing from the field.

Further, there appears to be relatively little work

examining how newly trained models on smaller

datasets (e.g., child-directed speech) compare to

children’s actual performance on semantic tasks. The

majority of the work in machine learning and natural

language processing has focused on building models

that outperform other models, or how the models

compare to task benchmarks for only young adult

populations. Therefore, it remains unclear how the

mechanisms proposed by these models compare to the

language acquisition and representation processes in

humans, although subsequent sections will make the

case that recent attempts towards incorporating

multimodal information, and temporal and attentional

influences is making significant strides in this

direction. Ultimately, it is possible that humans use

multiple levels of representation and more than one

mechanism to produce and maintain flexible semantic

representations that can be widely applied across a

wide range of tasks, and a brief review of how

empirical work on context, attention, perception, and

action has informed semantic models will provide a

finer understanding on some of these issues.

3 Contextual and Retrieval-based Se-

mantic Memory

Despite the traditional notion of semantic memory

being a “static” store of verbal knowledge about

concepts, accumulating evidence within the past few

decades suggests that semantic memory may actually

be context-dependent. Consider the meaning of the

word ostrich. Does the conceptualization of what the

word ostrich means change when an individual is

thinking about the size of different birds versus the

types of eggs one could use to make an omelet?

Although intuitively it appears that there is one “static”

representation of ostrich that remains unchanged

across different contexts, considerable evidence on the

time course of sentence processing suggests otherwise.

In particular, a large body of work has investigated how

semantic representations come online during sentence

comprehension and the extent to which these

representations depend on the surrounding context. For

example, there is evidence to show that the surrounding

sentential context and the frequency of meaning may

influence lexical access for ambiguous words (e.g.,

bark has a tree and sound-related meaning) at different

timepoints (Swinney, 1979; Tabossi, Colombo, & Job,

1987). Furthermore, extensive work by Rayner and

colleagues on eye movements in reading has shown

that the frequency of different meanings of a word, the

bias in the linguistic context, and preceding modifiers

can modulate the extent to which multiple meanings of

a word are automatically activated (Binder, 2003;

Binder & Rayner, 1998; Duffy et al., 1988; Duffy,

Morris, & Rayner, 1988; Pacht & Rayner, 1993;

Rayner, Cook, Juhasz, & Frazier, 2006; Rayner &

Frazier, 1989). Collectively, this work is consistent

with two-process theories of attention (Neely, 1977;

Posner & Snyder, 1975), according to which a fast,

automatic activation process, as well as a slow,

conscious attention mechanism are both at play during

language-related tasks. The two-process theory can

clearly account for findings like “automatic”

facilitation in lexical decisions for words related to the

dominant meaning of the ambiguous word in the

presence of biasing context (Tabossi et al., 1987), and

longer “conscious attentional” fixations on the

ambiguous word when the context emphasizes the non-

dominant meaning (Pacht & Rayner, 1993).

Another aspect of language processing is the ability

to consciously attend to different parts of incoming

linguistic input to form inferences on the fly. One line

of evidence that speaks to this behavior comes from

empirical work on reading and speech processing using

the N400 component of event-related brain potentials

(ERPs). The N400 component is thought to reflect

contextual semantic processing, and sentences ending

in unexpected words have been shown to elicit greater

N400 amplitude compared to expected words, given a

sentential context (e.g., Block & Baldwin, 2010;

Federmeier & Kutas, 1999; Kutas & Hillyard, 1980).

This body of work suggests that sentential context and

semantic memory structure interact during sentence

processing (see Federmeier & Kutas, 1999). Other

work has examined the influence of local attention,

context, and cognitive control during sentence

comprehension. In an eye-tracking paradigm, Nozari,

Trueswell, and Thompson-Schill (2016) had

participants listen to a sentence (e.g., “She will cage the

red lobster”) as they viewed four colorless drawings.

The drawings contained a local attractor (e.g., cherry)

which was compatible with the closest adjective (e.g.,

red) but not the overall context, or an adjective-

incompatible object (e.g., igloo). Context was

manipulated by providing a verb that was highly

constraining (e.g., cage) or non-constraining (e.g.,

describe). The results indicated that participants fixated

on the local attractor in both constraining and non-

constraining contexts, compared to incompatible

control words, although fixation was smaller in more

constrained contexts. Collectively, this work indicates

that linguistic context and attentional processes interact

and shape semantic memory representations, providing

further evidence for automatic and attentional

components (Neely, 1977; Posner & Snyder, 1975)

involved in language processing.

Given these findings and the automatic-attentional

framework, it is important to investigate how

computational models of semantic memory handle

ambiguity resolution (i.e., multiple meanings) and

attentional influences and depart from the traditional

notion of a context-free “static” semantic memory

store. Critically, DSMs that assume a static semantic

memory store (e.g., LSA, GloVe, etc.) cannot

straightforwardly account for the different contexts

under which multiple meanings of a word are activated

and suppressed, or how attending to specific linguistic

contexts can influence the degree to which other related

words are activated in the memory network. The

following sections will further elaborate on this issue

of ambiguity resolution and review some recent

literature on modeling contextually dependent

semantic representations.

3.1 Ambiguity Resolution in Error-free

Learning-based DSMs

Vir tua lly a ll DS Ms discu sse d so far con str uct a single

representation of a word’s meaning by aggregating

statistical regularities across documents or contexts.

This approach suffers from the drawback of collapsing

multiple senses of a word into an “average”

representation. For example, the homonym bark would

be represented as a weighted average of its two

meanings (the sound and the trunk), leading to a

representation that is more biased towards the more

dominant sense of the word. Homonyms (e.g., bark)

and polysemes (e.g., newspaper may refer to the

physical object or a national daily) represent over 40%

of all English words (Britton, 1978; Durkin &

Manning, 1989), and because DSMs do not

appropriately model the non-dominant sense of a word,

they tend to underperform in disambiguation tasks and

also cannot appropriately model the behavior observed

in sentence processing tasks (e.g., Swinney, 1979).

Indeed, Griffiths et al. (2007) have argued that the

inability to model representations for polysemes and

homonyms is a core challenge and may represent a key

falsification criterion for certain distributional models

(also see Jones, 2018). Early distributional models like

LSA and HAL recognized this limitation of collapsing

a word’s meaning into a single representation.

Landauer (2001) noted that LSA is indeed able to

disambiguate word meanings when given surrounding

context, i.e., neighboring words (for similar arguments,

see Burgess, 2001). To that end, Kintsch (2001)

proposed an algorithm operating on LSA vectors that

examined the local context around the target word to

compute different senses of the word. While the

approach of applying a process model over and above

the core distributional model could be criticized, it is

important to note that meaning is necessarily

distributed across several dimensions in DSMs and

therefore any process model operating on these vectors

is using only information already contained within the

vectors (see Günther et al., 2019 for a similar

argument).

An alternative proposal to model semantic memory

and also account for multiple meanings was put forth

by Blei, Ng and Jordan (2003) and Griffiths et al.

(2007) in the form of topic models of semantic

memory. In topic models, word meanings are

represented as a distribution over a set of meaningful

probabilistic topics, where the content of a topic is

determined by the words to which it assigns high

probabilities. For example, high probabilities for the

words desk, paper, board, and teacher might indicate

that the topic refers to a classroom, whereas high

probabilities for the words board, flight, bus and

baggage might indicate that the topic refers to travel.

Thus, in contrast to geometric DSMs where a word is

represented as a point in a high-dimensional space,

words (e.g., board) can have multiple representations

across the different topics (e.g., classroom, travel) in a

topic model. Importantly, topic models take the same

word-document matrix as input as LSA and uncover

latent “topics” in the same spirit of uncovering latent

dimensions through an abstraction-based mechanism

that goes over and above simply counting direct co-

occurrences, albeit through different mechanisms,

based on Markov Chain Monte Carlo methods

(Griffiths & Steyvers, 2002; 2003; 2004). Topic models

successfully account for free association norms that

show violations of symmetry, triangle inequality, and

neighborhood structure (Tversky, 1977) that are

problematic for other DSMs (but see Jones,

Gruenenfelder, & Recchia, 2018) and also outperform

LSA in disambiguation, word prediction, and gist

extraction tasks (Griffiths et al., 2007). However, the

original architecture of topic models involved setting

priors and specifying the number of topics a priori,

which could lead to the possibility of experimenter bias

in modeling (Jones, Willits, & Dennis, 2011). Further,

the original topic model was essentially a “bag-of-

words” model and did not capitalize on the sequential

dependencies in natural language, like other DSMs

(e.g., BEAGLE). Recent work by Andrews and

Vig lio cco (201 0) has extended the topic model to

incorporate word order information, yielding more

fine-grained linguistic representations that are sensitive

to higher-order semantic relationships. Additionally,

given that topic models represent word meanings as a

distribution over a set of topics, they naturally account

for multiple senses of a word without the need for an

explicit process model, unlike other DSMs such as LSA

or HAL (Griffiths et al., 2007).

Therefore, it appears that when DSMs are provided

with appropriate context vectors through their

representation (e.g., topic models) or additional

assumptions (e.g., LSA), they are indeed able to

account for patterns of polysemy and homonymy.

Additionally, there has been a recent movement in

natural language processing to build distributional

models that can naturally tackle homonymy and

polysemy. For example, Reisinger and Mooney (2010)

used a clustering approach to construct sense-specific

word embeddings, that were successfully able to

account for word similarity in isolation and within a

sentential context. In their model, a word’s contexts

were clustered to produce different groups of similar

context vectors, and these context vectors were then

averaged into sense-specific vectors for the different

clusters. A slightly different clustering approach was

taken by Li and Jurafsky (2015), where the sense

clusters and embeddings were jointly learned using a

Bayesian non-parametric framework. Their model used

the Chinese Restaurant Process, according to which a

new sense vector for a word was computed when

evidence from the context (e.g., neighboring and co-

occurring words) suggested that it was sufficiently

different from the existing senses. Li and Jurafsky

indicated that their model successfully outperformed

traditional embeddings on semantic relatedness tasks.

Other work in this area has employed multilingual

distributional information to generate different senses

for words (Upadhyay, Chang, Taddy, Kalai, & Zou,

2017), although the use of multiple languages to

uncover word senses does not appear to be a

psychologically plausible proposal for how humans

derive word senses from language. Importantly, several

of these recent approaches rely on error-free learning-

based mechanisms to construct semantic

representations that are sensitive to context. The

following section describes some recent work in

machine learning that has focused on error-driven

learning mechanisms that can also adequately account

for contextually-dependent semantic representations.

3.2 Ambiguity Resolution in Predictive

DSMs

One particular drawback of multi-sense embeddings

discussed above is that the meaning of a word can vary

across multiple sentential contexts and enumerating all

the different senses for a particular word can be both

subjective (Westbury, 2016) and computationally

expensive. For example, the word star can refer to its

astronomical meaning, a film star, a rockstar, as well as

an asterisk among other things, and the surrounding

linguistic context itself may be more informative in

understanding the meaning of the word star, instead of

trying to enumerate all the different senses of star,

which was the goal of multi-sense embeddings. The

idea of using sentential context itself to derive a word’s

meaning was first proposed in Elman’s (1990) seminal

work on the Simple Recurrent Network (SRN), where

a set of context units that contained the previous hidden

state of the neural network model served as “memory”

for the next cycle. In this way, the internal

representations that the SRN learned were sensitive to

previously encountered linguistic context. This simple

recurrent architecture successfully predicted word

sequences, grammatical classes, and constituent

structure in language (Elman, 1990; 1991). Modern

Recurrent Neural Networks (RNNs) build upon the

intuitions of the SRN and come in two architectures:

Long Short-Term M emory ( LSTM ) a nd G ated

Recurrent Unit (GRU). LSTMs introduced the idea of

memory cells, i.e., a vector that could preserve error

signals over time and overcome the problem of

vanishing error signals over long sequences

(Hochreiter & Schmidhuber, 1997). Access to the

memory cells is controlled through gates in LSTMs,

where gate values are linear combinations of the

current input and the previous model state. GRUs also

have a gated architecture but differ in the number of

gates and how they combine the hidden states (Olah,

Figure 6. A d epi cti on of th e ELM o a rchit ect ure. The h idd en la yer s o f two lo ng sh ort -term memory networks (LSTMs;

forward and backward) are first concatenated, followed by a weighted sum of the hidden layers with the embedding layer,

resulting in the final 3-layer representation for a particular word. Adapted from Alammar (2019).

2014). LSTMs and GRUs are currently the most

successful types of RNNs and have been extensively

applied to construct contextually sensitive,

compositional (discussed in Section IV) models of

semantic memory.

The RNN approach inspired Peters et al. (2018) to

construct Embeddings from Language Models

(ELMo), a modern version of recurrent neural

networks (RNNs). Peters et al.’s ELMo model uses a

bidirectional LSTM combined with a traditional NN

language model to construct contextual word

embeddings. Specifically, instead of explicitly training

to predict predefined or empirically determined sense

clusters, ELMo first tries to predict words in a sentence

going sequentially forward and then backward,

utilizing recurrent connections through a 2-layer

LSTM. The embeddings returned from these

“pretrained” forward and backward LSTMs are then

combined with a task-specific NN model to construct a

task-specific representation (see Figure 6). One key

innovation in the ELMo model is that instead of only

using the topmost layer produced by the LSTM, it

computes a weighed linear combination of all three

layers of the LSTM to construct the final semantic

representation. The logic behind using all layers of the

LSTM in ELMo is that this process yields very rich

word representations, where higher-level LSTM states

capture contextual aspects of word meaning and lower-

level states capture syntax and parts of speech. Peters

et al. showed that ELMo’s unique architecture is

successfully able to outperform other models in

complex tasks like question answering, coreference

resolution, and sentiment analysis among others. The

success of recent recurrent models such as ELMo in

tackling multiple senses of words represents a

significant leap forward in modeling contextualized

semantic representations.

Modern RNNs such as ELMo have been successful

at predicting complex behavior because of their ability

to incorporate previous states into semantic

representations. However, one limitation of RNNs is

that they encode the entire input sequence at once,

which slows down processing and becomes

problematic for extremely long sequences. For

example, consider the task of text summarization,

where the input is a body of text, and the task of the

model is to paraphrase the original text. Intuitively, the

model should be able to “attend” to specific parts of the

text and create smaller “summaries” that effectively

paraphrase the entire passage. This intuition inspired

the attention mechanism, where “attention” could be

focused on a subset of the original input units by

weighting the input words based on positional and

semantic information. The model would then predict

target words based on relevant parts of the input

sequence. Bahdanau, Cho, and Bengio (2014) first

applied the attention mechanism to machine translation

using two separate RNNs to first encode the input

sequence and then used an attention head to explicitly

focus on relevant words to generate the translated

outputs. “Attention” was focused on specific words by

computing an alignment score, to determine which

input states were most relevant for the current time step

and combining these weighted input states into a

context vector. This context vector was then combined

with the previous state of the model to generate the

predicted output. Bahdanau et al. showed that the

attention mechanism was able to outperform previous

models in machine translation (e.g., Cho et al., 2014)

especially for longer sentences.

Attention NNs are now at the heart of several state-

of-the-art language models, like Google’s Transformer

(Vaswani et al., 2017), BERT (Devlin, Chan, Lee, &

Tou ta nov a, 2 019 ), O pen AI ’s GPT-2 (Radford et al.,

2019) and GPT-3 (Brown et al., 2020), and Facebook’s

RoBERTa (Liu et al., 2019). Two key innovations in

these new attention-based NNs have led to remarkable

performance improvements in language processing

tasks. First, these models are being trained on a much

larger scale than ever before, allowing them to learn

from a billion iterations and over several days (e.g.,

Radford et al., 2019). Second, modern attention-NNs

entirely eliminate the sequential recurrent connections

that were central to RNNs. Instead, these models use

multiple layers of attention and positional information

to process words in parallel. In this way, they are able

to focus attention on multiple words at a time to

perform the task at hand. For example, Google’s BERT

model assigns position vectors to each word in a

sentence. These position vectors are then updated using

attention vectors, which represent a weighted sum of

position vectors of other words and depend upon how

strongly each position contributes to the word’s

representation. Specifically, attention vectors are

computed using a compatibility function (similar to an

alignment score in Bahdanau et al., 2014), which

assigns a score to each pair of words indicating how

strongly they should attend to one another. These

computations iterate over several layers and iterations

with the dual goal of predicting masked words in a

sentence (e.g., I went to the [mask] to buy a [mask] of

milk; predict store and carton) as well as deciding

whether one sentence (e.g., They were out of reduced

fat [mask], so I bought [mask] milk) is a valid

continuation of another sentence (e.g., I went to the

store to buy a carton of milk). By computing errors

bidirectionally and updating the position and attention

vectors with each iteration, BERT’s word vectors are

influenced by other words’ vectors and tend to develop

contextually dependent word embeddings. For

example, the representation of the word ostrich in the

BERT model would be different when it is in a sentence

about birds (e.g., ostriches and emus are large birds) vs.

food (ostrich eggs can be used to make omelets), due

to the different position and attention vectors

contributing to these two representations. Importantly,

the architecture of BERT allows it to be flexibly

finetuned and applied to any semantic task, while still

using the basic attention-based mechanism. This

framework turns out to be remarkably efficient and

models based on the general Transformer architecture

(e.g., BERT, RoBERTa, GPT-2, & GPT-3) outperform

LSTM-based recurrent approaches in semantic tasks

such as sentiment analysis (Socher et al., 2013),

sentence acceptability judgments (Warstadt et al.,

2018), and even tasks that are dependent on semantic

and world knowledge, such as the Winograd Schema

Challenge (Levesque, Davis, & Morgenstern, 2011) or

novel language generation (Brown et al., 2020).

However, considerable work is beginning to evaluate

these models using more rigorous test cases and

starting to question whether these models are actually

learning anything meaningful (e.g., Brown et al., 2020;

Niven & Kao, 2019), an issue that is discussed in detail

in Section V.

Although the technical complexity of attention-

based NNs makes it difficult to understand the

underlying mechanisms contributing to their

impressive success, some recent work has attempted to

demystify these models (e.g., Clark, Khandelwal,

Levy, & Manning, 2019; Coenen et al., 2019; Michel,

Levy, & Neubig, 2019; Tenney, Das, & Pavlick, 2019).

For example, Clark et al., (2019) recently showed that

BERT’s attention heads actually attend to meaningful

semantic and syntactic information in sentences, such

as determiners, objects of verbs, and coreferent

mentions (see Figure 7), suggesting that these models

may indeed be capturing meaningful linguistic

knowledge, which may be driving their performance.

Further, some recent evidence also shows that BERT

successfully captures phrase-level representations,

indicating that BERT may indeed have the ability to

model compositional structures (Jawahar, Sagot, &

Seddah, 2019), although this work is currently in its

nascent stages. Furthermore, it remains unclear how

this conceptualization of attention fits with the

automatic-attentional framework (Neely, 1977).

Demystifying the inner workings of attention NNs and

focusing on process-based accounts of how

computational models may explain cognitive

phenomena clearly represents the next step towards

integrating these recent computational advances with

empirical work in cognitive psychology.

Figure 7. BERT attention heads that correspond to linguistic

phenomena like attending to noun phrases and verbs.

Arrows indicate specific relationships that the heads are

attending to within each sentence. Adapted from Clark et al.

(2019)

Collectively, these recent approaches to construct

contextually sensitive semantic representations

(through recurrent and attention-based NNs) are

showing unprecedented success at addressing the

bottlenecks regarding polysemy, attentional influences,

and context that were considered problematic for

earlier DSMs. An important insight that is common to

both contextualized RNNs and attention-based NNs

discussed above is the idea of contextualized semantic

representations, a notion that is certainly at odds with

the traditional conceptualization of context-free

semantic memory. Indeed, the following section

discusses a new class of models take this notion a step

further by entirely eliminating the need for learning

representations or “semantic memory” and propose

that all meaning representations may in fact be

retrieval-based, therefore blurring the historical

distinction between episodic and semantic memory.

3.3 Retrieval-based Models of Semantic

Memory

Tulving’s (1972) episodic-semantic dichotomy

inspired foundational research on semantic memory

and laid the groundwork for conceptualizing semantic

memory as a static memory store of facts and verbal

knowledge that was distinct from episodic memory,

which was linked to events situated in specific times

and places. However, some recent attempts at modeling

semantic memory have taken a different perspective on

how meaning representations are constructed.

Retrieval-based models challenge the strict distinction

between semantic and episodic memory, by

constructing semantic representations through

retrieval-based processes operating on episodic

experiences. Retrieval-based models are based on

Hintzman’s (1988) MINERVA 2 model, which was

originally proposed to explain how individuals learn to

categorize concepts. Hintzman argued that humans

store all instances or episodes that they experience, and

that categorization of a new concept is simply a

weighted function of its similarity to these stored

instances at the time of retrieval. In other words, each

episodic experience lays down a trace, which implies

that if an item is presented multiple times, it has

multiple traces. At the time of retrieval, traces are

activated in proportion to its similarity with the

retrieval cue or probe. For example, an individual may

have seen an ostrich in pictures or at the zoo multiple

times and would store each of these instances in

memory. The next time an ostrich-like bird is

encountered by this individual, they would match the

features of this bird to a weighted sum of all stored

instances of ostrich and compute the similarity

between these features to decide whether the new bird

is indeed an ostrich. Hintzman’s work was crucial in

developing the exemplar theory of categorization,

which is often contrasted against the prototype theory

of categorization (Rosch, 1975), which suggests that

individuals “learn” or generate an abstract prototypical

representation of a concept (e.g., ostrich) and compare

new examples to this prototype to organize concepts

into categories. Importantly, Hintzman’s model

rejected the need for a strong distinction between

episodic and semantic memory (Tulving, 1972) and has

inspired a class of models of semantic memory often

referred to as retrieval-based models.

Kwantes (2005) proposed a retrieval-based

alternative to LSA-type distributional models by

computing semantic representations “on the fly” from

a term-document matrix of episodic experiences.

Based on principles from Hintzman’s (1988)

MINERVA 2 model, in Kwantes’ model, each word has

a context vector (i.e., memory trace) associated with it,

which contains its frequency of occurrence within each

document of the training corpus. When a word is

encountered in the environment, it is used as a cue to

retrieve the context vector, which activates the traces

of all words in lexical memory. The activation of a trace

is directly proportional to the contextual similarity

between their context vectors. Memory traces are then

weighted by their activations and summed across the

context vectors to construct the final semantic

representation of the target word. The resulting

semantic representations from Kwantes’ model

successfully captured higher-order semantic

relationships, similar to LSA, without the need for

storing, abstracting, or learning these representations at

the time of encoding.

Modern retrieval-based models have been

successful at explaining complex linguistic and

behavioral phenomena, such as grammatical

constraints (Johns & Jones, 2015) and free association

(Howard et al., 2011), and certainly represent a

significant departure from the models discussed thus

far. For example, Howard et al. (2011) proposed a

model that constructed semantic representations using

temporal context. Instead of defining context in terms

of a sentence or document like most DSMs, the

Predictive Temporal Context Model (pTCM; see also

Howard & Kahana, 2002) proposes a continuous

representation of temporal context that gradually

changes over time. Items in the pTCM are activated to

the extent that their encoded context overlaps with the

context that is cued. Further, context is also used to

predict items that are likely to appear next, and the

semantic representation of an item is the collection of

prediction vectors in which it appears over time. These

previously learned prediction vectors also contribute to

the word’s future representations. Howard et al.

showed that the pTCM successfully simulates human

performance in word association tasks and is able to

capture long-range dependencies in language that are

problematic for other DSMs. In its core principles of

constructing representations from episodic contexts,

the pTCM is similar to other retrieval-based models,

but its ability to learn from previous states and

gradually accumulate information also shares

similarities with the SRN (Elman, 1990), BEAGLE

(Jones & Mewhort, 2007), and some of the recent error-

driven learning DSMs discussed in Section II (e.g.,

word2vec, ELMo, etc.).

More recently, Jamieson, Avery, Johns, and Jones

(2018) proposed an instance-based theory of semantic

memory, also based on MINERVA 2. In their model,

word contexts are stored as n-dimensional vectors

representing multiple instances in episodic memory.

Memory of a document (or conversation) is the sum of

all word vectors, and a “memory” vector stores all

documents in a single vector. A word’s meaning is

retrieved by cueing the memory vector with a probe,

which activates each trace in proportion to its similarity

to the probe. The aggregate of all activated traces is

called an echo, where the contribution of a trace is

directly weighted by its activation. The retrieved echo,

in response to a probe, is assumed to represent a word’s

meaning. Therefore, the model exhibits “context

sensitivity” by comparing the activations of the

retrieval probe with the activations of other traces in

memory, thus producing context-dependent semantic

representations without any mechanism for learning

these representations. For example, Jamieson et al.

showed that for the homograph break (with three

senses, related to stopping, smashing, and news

reporting), when their model is provided with a

disambiguating context using a joint probe (e.g.,

break/car), the retrieved representation (or “echo”) is

more similar to the word stop, compared to the words

report and smash, thus producing a context-dependent

semantic representation of the word break. Therefore,

Jamieson et al.’s model successfully accounts for some

findings pertaining to ambiguity resolution that have

been difficult to accommodate within traditional DSM-

based accounts and proposes that meaning is created

“on the fly” and in response to a retrieval cue, an idea

that is certainly inconsistent with traditional semantic

models.

3.4 Summary

Although it is well-understood that prior knowledge or

semantic memory influences how individuals perceive

events (e.g., Bransford & Johnson, 1979; Deese, 1959;

Roediger & McDermott, 1995), the notion that

semantic memory may itself be influenced by episodic

events is relatively recent. This section discussed how

the conceptualization of semantic memory of being an

independent and static memory store is slowly

changing, in light of evidence that context shapes the

structure of semantic memory. Retrieval-based models

represent an important departure from the traditional

notions about semantic memory, and instead propose

that the meaning of a word is computed “on the fly” at

retrieval, and do not subscribe to the idea of storing or

learning a static semantic representation of a concept.

This conceptualization is clearly at odds with

traditional accounts of semantic memory and hearkens

back to the distinction between prototype and exemplar

theories of categorization briefly eluded to earlier.

Specifically, in the computational models of semantic

memory discussed so far (with the exception of

retrieval-based models), the idea of inferring indirect

co-occurrences and/or latent dimensions, i.e., learning

through abstraction emerges as a core mechanism

contributing to the construction of meaning. This idea

of abstraction has also been central to computational

models that have been applied to understand category

structure. Specifically, prototype theories (Rosch et al.,

1976; Rosch & Lloyd, 1978; also see Posner & Keele,

1968) posit that as individual concepts are experienced,

humans gradually develop a prototypical

representation that contains the most useful and

representative information about that category. This

notion of constructing an abstracted, prototypical

representation is at the heart of several computational

models of semantic memory discussed in this review.

For example, both LSA and BEAG LE con struct an

“average” prototypical semantic representation from

individual linguistic experiences. Of course, LSA uses

a term-document matrix and singular value

decomposition whereas BEAGLE learns meaning by

incrementally combining co-occurrence and order

information to compute a composite representation, but

both models represent a word as a single point

(prototype) in a high-dimensional space. Retrieval-

based models, on the other hand, are inspired by

Hintzman’s work and the exemplar theory of

categorization and assume that semantic

representations are constructed in response to retrieval

cues and reject the idea of prototypical representations

or abstraction-like learning processes occurring at the

time of encoding. Given the success of retrieval-based

models at tackling ambiguity and several other

linguistic phenomena, these models clearly represent a

powerful proposal for how meaning is constructed.

However, before abstraction (at encoding) can be

rejected as a plausible mechanism underlying meaning

computation, retrieval-based models need to address

several bottlenecks, only one of which is computational

complexity. Jones (2018) recently noted that

computational constraints should not influence our

preference of traditional prototype models over

exemplar-based models, especially since exemplar

models have provided better fits to categorization task

data, compared to prototype models (Ashby &

Maddox, 1993; Nosofsky, 1988; Stanton, Nosofsky, &

Zaki, 2002). However, implementation is a core test for

theoretical models and retrieval-based models must be

able to explain how the brain manages this

computational overhead. Specifically, retrieval-based

models argue against any type of “semantic memory”

at all and instead propose that semantic representations

are created on the fly when words or concepts are

encountered within a particular context. As discussed

earlier, while there is evidence to suggest that the

representations likely change with every new

encounter (e.g., for a review, see Yee & Thompson-

Schill, 2018), it is still unclear why the brain would

create a fresh new representation for a particular

concept on the fly each time that concept is

encountered, and not “learn” something about the

concept from previous encounters that could aid future

processing. It seems more psychologically plausible

that the brain learns and maintains a semantic

representation (stored via changes in synaptic activity,

see Mayford, Siegelbaum, & Kandel, 2012) that is

subsequently finetuned or modified with each new

incoming encounter – a proposal that is closer to the

mechanisms underlying recurrent and attention-NNs

discussed earlier in this section. Furthermore, in light

of findings that top-down information or previous

knowledge does in fact guide cognitive behavior (e.g.,

Bransford & Johnson, 1979; Deese, 1959; Roediger &

McDermott, 1995) and bottom-up processes interact

with top-down processes (Neisser, 1976), the proposal

that there may not be any existing semantic structures

in place at all certainly requires more investigation. It

is important to note here that individual traces for

episodic events may indeed need to be stored by the

system for other cognitive tasks, but the argument here

is that retrieving the meaning of a concept need not

necessarily require the storage of every individual

experience or trace. For example, consider the simple

memory task of remembering a list of words: train,

ostrich, lemon, and truth. Encoding a representation of

this event likely involves laying down a trace of this

experience in memory. However, retrieval-based

models would posit that the representation of the word

ostrich in this context would in fact be a weighted sum

of every other time the word or concept of ostrich has

been experienced, all of which have been stored in

memory. This conceptualization seems unnecessary,

especially given that other DSMs that instead use more

compact learning-based representations have been

fairly successful at simulating performance in semantic

as well as non-semantic tasks (for a model of LSA-type

semantic structures applied to free recall tasks, see

Polyn, Norman, & Kahana, 2009).

Additionally, it appears that retrieval-based models

currently lack a complete account of how long-term

sequential dependencies, sentential context, and

multimodal information might simultaneously

influence the computation of meaning. For example,

how does multimodal information about an object get

stored in retrieval-based models – does each individual

sensorimotor encounter also leave its own trace in

memory and contribute to the “context-specific”

representation or is the scope of “context” limited to

patterns of co-occurrence? Further, it remains unclear

how representations derived from retrieval-based

models differ from representations derived from

modern RNNs and attention-based NNs, that also

propose contextualized representations. It appears that

these classes of models share similarities in their

fundamental claim that the retrieval context determines

the representation of a concept or word, although

retrieval-based models do not subscribe to any

particular learning mechanism (with the exception of

Howard et al.’ s predictive pTCM model), whereas

RNNs and attention-NNs are based on error-driven

learning mechanisms. Specifically, RNNs and

attention-NNs learn via prediction and incrementally

build semantic representations, whereas retrieval-

based models instead propose that representations are

constructed solely at the time of retrieval, without any

learning occurring at the time of exposure or encoding.

Furthermore, while RNNs and attention-NNs take

word order and positional information (e.g.,

bidirectionality in BERT) into account within their

definition of “context” when constructing semantic

representations, it appears that recent retrieval-based

models currently lack mechanisms to incorporate word

order into their representations (e.g., Jamieson et al.,

2018), even though this may simply be a practical

limitation at this point. Finally, it is unclear how

retrieval-based models would scale up to sentences,

paragraphs and other higher-order structures like

events, issues that are being successfully addressed by

other learning-based DSMs (see Sections III and IV).

Clearly, more research is needed to adequately assess

the relative performance of retrieval-based models,

compared to state-of-the-art learning-based models of

semantic memory currently being widely applied in the

literature to a large collection of semantic (and non-

semantic) tasks. Collectively, it seems most likely that

humans store individual exemplars in some form (e.g.,

a distributed pattern of activation) or at least to some

extent (e.g., storing only traces above a certain

threshold of stable activation), but also learn a

prototypical representation as consistent exemplars are

experienced, which facilitates faster top-down

processing (for a similar argument, see Yee et al., 2018)

in cognitive tasks, although this issue clearly needs to

be explored further.

The central idea that emerged in this section is that

semantic memory representations may indeed vary

across contexts. The accumulating evidence that

meaning rapidly changes with linguistic context

certainly necessitates models that can incorporate this

flexibility into word representations. Attention-based

NNs like BERT and GPT-2/3 represent a promising

step towards constructing such contextualized,

attention-based representations and appear to be

consistent with the automatic and attentional

components of language processing (Neely, 1977),

although more work is needed to clarify how these

models compute meaningful representations that can

be flexibly applied across different tasks. The success

of attention-based NNs is truly impressive on one hand

but also cause for concern on the other. First, it is

remarkable that the underlying mechanisms proposed

by these models at least appear to be psychologically

intuitive and consistent with empirical work showing

that attentional processes and predictive signals do

indeed contribute to semantic task performance (e.g.,

Nozari et al., 2016). However, if the ultimate goal is to

build models that explain and mirror human cognition,

the issues of scale and complexity cannot be ignored.

Current state-of-the-art models operate at a scale of

word exposure that is much larger than what young

adults are typically exposed to (De Deyne, Perfors, &

Storms, 2016; Lake, Ullman, Tenenbaum, &

Gershman, 2017). Therefore, exactly how humans

perform the same semantic tasks without the large

amounts of data available to these models remains

unknown. One line of reasoning is that while humans

have lesser linguistic input compared to the corpora

that modern semantic models are trained on, humans

instead have access to a plethora of non-linguistic

sensory and environmental input, which is likely

contributing to their semantic representations. Indeed,

the following section discusses how conceptualizing

semantic memory as a multimodal system sensitive to

perceptual input represents the next big paradigm shift

in the study of semantic memory.

4 Grounding Models of Semantic

Memory

Vir tua lly all dis tri but ion al and netw ork -based semantic

models rely on large text corpora or databases to

construct semantic representations. Consequently, a

consistent and powerful criticism of distributional

semantic models comes from the grounded cognition

movement (Barsalou, 2016), which rejects the idea that

meaning can be represented through abstract and

amodal symbols like words in a language. Instead,

grounded cognition researchers posit that sensorimotor

modalities, the environment, and the body all

contribute and play a functional role in cognitive

processing, and by extension, the construction of

meaning. Grounded (or embodied) cognition is a rather

broad enterprise that attempts to redefine the study of

cognition (Matheson & Barsalou, 2018). Within the

domain of semantic memory, distributional models in

particular have been criticized because they derive

semantic representations from only linguistic texts and

are not grounded in perception and action, leading to

the symbol grounding problem (Harnad, 1990; Searle,

1980), i.e., how can the meaning of a word (e.g., an

ostrich) be grounded only in other words (e.g., big,

bird, etc.) that are further grounded in more words?

While there is no one theory of grounded cognition

(Matheson & Barsalou, 2018), the central tenet

common to several of them is that the body, brain, and

physical environment dynamically interact to produce

meaning and cognitive behavior. For example, based

on Barsalou’s account (Barsalou, 1999; 2003; 2008),

when an individual first encounters an object or

experience (e.g., a knife), it is stored in the modalities

(e.g., its shape in the visual modality, its sharpness in

the tactile modality etc.) and the sensorimotor system

(e.g., how it is used as a weapon or kitchen utensil).

Repeated co-occurrences of physical stimulations

result in functional associations (likely mediated by

associative Hebbian learning and/or connectionist

mechanisms) that form a multimodal representation of

the object or experience (Matheson & Barsalou, 2018).

Features of these representations are activated through

recurrent connections, which produces a simulation of

past experiences. These simulations not only guide an

individual’s ongoing behavior retroactively (e.g., how

to dice onions with a knife), but also proactively

influence their future or imagined plans of action (e.g.,

how one might use a knife in a fight). Simulations are

assumed to be neither conscious nor complete

(Barsalou, 2005; Barsalou et al., 2003), and are

sensitive to cognitive and social contexts (Lebois,

Wilson-Mendenhall, & Barsalou, 2015).

There is some empirical support for the grounded

cognition perspective from sensorimotor priming

studies. In particular, there is substantial evidence that

modality-specific neural information is activated

during language processing tasks. For example, it has

been demonstrated that reading verbs like kick

(corresponding to feet) or pick (corresponding to hand)

activates the motor cortex in a somatotopic fashion

(Pulvermüller, 2005), passive reading of taste-related

words (e.g., salt) activates gustatory cortices (Barros-

Loscertales et al., 2011), and verifying modality-

specific properties of words (e.g., color, taste, sound

and touch) activates the corresponding sensory brain

regions (Goldberg, Perfetti, & Schneider, 2006).

However, whether the activation of modality-specific

information is incidental to the task and simply a result

of post-representation processes, or actually part of the

semantic representation itself is an important question.

Support for the latter argument comes from studies

showing that transcranial stimulation of areas in the

premotor cortex related to the hand facilitates lexical

decision performance for hand-related action words

(Willems et al., 2011), Parkinson’s patients show

selective impairment in comprehending motor action

words (Fernandino et al., 2013), and damage to brain

regions supporting object-related action can hinder

access to knowledge about how objects are

manipulated (Yee, Chrysikou, Hoffman, & Thompson-

Schill, 2013). Yee et al. also showed that when

individuals performed a concurrent manual task while

naming pictures, there was more naming interference

for objects that are more manually used (e.g., pencils),

compared to objects that are not typically manually

used (e.g., tigers). Furthermore, Yee, Huffstetler, and

Thompson-Schill (2011) used a visual eye-tracking

paradigm to show that as an object unfolds over time

(e.g., auditorily hearing frisbee), particular features

(e.g., form-related) come online in a temporally

constrained fashion and can influence eye fixation

times for related words (e.g., e.g., participants fixated

longer on pizza, because frisbee and pizza are both

round). Taken together, these findings suggest that

semantic memory representations are accessed in a

dynamic way during tasks and different perceptual

features of these representations may be accessed at

different timepoints, suggesting a more flexible and

fluid conceptualization of semantic memory that can

change as a function of task. Therefore, it is important

to evaluate whether computational models of semantic

memory can indeed encode these rich, non-linguistic

features as part of their representations.

It is important to note here that while the

sensorimotor studies discussed above provide support

for the grounded cognition argument, these studies are

often limited in scope to processing sensorimotor

words and do not make specific predictions about the

direction of effects (Matheson & Barsalou, 2018;

Matheson et al., 2015). For example, although several

studies show that modality-specific information is

activated during behavioral tasks, it remains unclear

whether this activation leads to facilitation or inhibition

within a cognitive task. Indeed, both types of findings

are taken to support the grounded cognition view,

therefore leading to a lack of specificity in predictions

regarding the role of modality-specific information

(Matheson et al., 2015), although some recent work has

proposed that timing of activation may be critical in

determining how modality-specific activation

influences cognitive performance (Matheson &

Barsalou, 2018). Another strong critique of the

grounded cognition view is that it has difficulties

accounting for how abstract concepts (e.g., love,

freedom etc.) that do not have any grounding in

perceptual experience are acquired or can possibly be

simulated (Dove, 2011). Some researchers have

attempted to “ground” abstract concepts in metaphors

(Lakoff & Johnson, 1999), emotional or internal states

(Vigliocco et al., 2013), or temporally distributed

events and situations (Barsalou & Wiemer-Hastings,

2005), but the mechanistic account for the acquisition

of abstract concepts is still an active area of research.

Finally, there is a dearth of formal models that provide

specific mechanisms by which features acquired by the

sensorimotor system might be combined into a

coherent concept. Some accounts suggest that semantic

representations may be created by patterns of

synchronized neural activity, which may represent

different sensorimotor information (Schneider,

Debener, Oostenveld, & Engel, 2008). Other work has

suggested that certain regions of the cortex may serve

as “hubs” or “convergence zones” that combine

features into coherent representations (Patterson,

Nestor, & Rogers, 2007), and may reflect temporally

synchronous activity within areas to which the features

belong (Damasio, 1989). However, comparisons of

such approaches to DSMs remain limited due to the

lack of formal grounded models, although there have

been some recent attempts at modeling perceptual

schemas (Pezzulo & Calvi, 2010) and Hebbian learning

(Garagnani & Pulvermüller, 2016).

Proponents of the grounded cognition view have

also presented empirical (Glenberg & Robertson, 2000;

Rubinstein, Levi, Schwartz, & Rappoport, 2015) and

theoretical criticisms (Barsalou, 2003; Perfetti, 1998)

of DSMs over the years. For example, Glenberg and

Robertson (2000) reported three experiments to argue

that high-dimensional space models like LSA/HAL are

inadequate theories of meaning, because they fail to

distinguish between sensible (e.g., filling an old

sweater with leaves) and nonsensical sentences (e.g.,

filling an old sweater with water) based on cosine

similarity between words (but see Burgess, 2000).

Some recent work also shows that traditional DSMs

trained solely on linguistic corpora do indeed lack

salient features and attributes of concepts. Baroni and

Lenci (2008) compared a model analogous to LSA with

attributes derived from McRae et al. (2005) and an

image-based dataset. They provided evidence that

DSMs entirely miss external (e.g., a car <has wheels>)

and surface level (e.g., a banana <is yellow>)

properties of objects, and instead focus on taxonomic

(e.g., cat-dog) and situational relations (e.g., spoon-

bowl), which are more frequently encountered in

natural language. More recently, Rubinstein et al.

(2016) evaluated four computational models, including

word2vec and GloVE and showed that DSMs are poor

at classifying attributive properties (e.g., an elephant

<is large>), but relatively good at classifying

taxonomic properties (e.g., apple <is a> fruit)

identified by human subjects in a property generation

task (also see Collell & Moens, 2016; Lucy & Gauthier,

2017).

Collectively, these studies appear to underscore

the intuitions of the grounded cognition researchers

that semantic models based solely on linguistic sources

do not produce sufficiently rich representations. While

this is true, it is important to realize here that the failure

of DSMs to encode these perceptual features is a

function of the training corpora they are exposed to,

i.e., a practical limitation, and not necessarily a

theoretical one. Early DSMs were trained on linguistic

corpora not because it was intrinsic to the theoretical

assumptions made by the models, but because text

corpora were easily available (for more fleshed-out

arguments on this issue, see Burgess, 2000; Günther et

al., 2019; Landauer & Dumais, 1997). Therefore, the

more important question is whether DSMs can be

adequately trained to derive statistical regularities from

other sources of information (e.g., visual, haptic,

auditory etc.), and whether such DSMs can effectively

incorporate these signals to construct “grounded”

semantic representations.

4.1 Grounding DSMs through Feature Inte-

gration

The lack of grounding in standard DSMs led to a

resurging interest in early feature-based models

(McRae et al., 1997; Smith, Shoben, & Rips, 1974). As

discussed earlier, early feature-based models

represented words as a collection of binary features

(e.g., birds have wings, whereas cars do not), and

words with similar meanings had greater overlap in

their constituent features (McCloskey & Glucksberg,

1979; Smith, Shoben, & Rips, 1974; Tversky, 1975),

although these early models did not have explicit

mechanisms to account for how features were learned

in the first place. However, one important strength of

feature-based models was that the features encoded

could directly be interpreted as placeholders for

grounded sensorimotor experiences (Baroni & Lenci,

2008). For example, the representation of a banana is

distributed across several hundred dimensions in a

distributional approach and these dimensions may or

may not be interpretable (Jones, Willits, & Dennis,

2015), but the perceptual experience of the banana’s

color being yellow can be directly encoded in feature-

based models (e.g., banana <is yellow>). However, it

is important to note here that again, the fact that

features can be verbalized and are more interpretable

compared to dimensions in a DSM is a result of the

features having been extracted from property

generation norms, compared to textual corpora.

Therefore, it is possible that some of the information

captured by property generation norms may already be

encoded in DSMs, albeit through less interpretable

dimensions. Indeed, a systematic comparison of

feature-based and distributional models by Riordan and

Jones (2011) demonstrated that representations derived

from DSMs produced comparable categorical structure

to feature representations generated by humans, and the

type of information encoded by both types of models

was highly correlated but also complementary. For

example, DSMs gave more weight to actions and

situations (e.g., eat, fly, swim) that are frequently

encountered in the linguistic environment, whereas

feature-based representations were better are capturing

object-specific features (e.g., <is yellow>, <made of

metal>), that potentially reflected early sensorimotor

experiences with objects. Riordan and Jones argued

that children may be more likely to initially extract

information from sensorimotor experiences. However,

as they acquire more linguistic experience, they may

shift to extracting the redundant information from the

distributional structure of language and rely on

perception for only novel concepts or the unique

sources of information it provides. This idea is

consistent with the symbol interdependency hypothesis

(Louwerse, 2011), which proposes that while words

must be grounded in the sensorimotor action and

perception, they also maintain rich connections with

each other at the symbolic level, which allows for more

efficient language processing by making it possible to

skip grounded simulations when unnecessary. The

notion that both sources of information are critical to

the construction of meaning presents a promising

approach to reconciling distributional models with the

grounded cognition view of language (for similar

accounts, see Barsalou, Santos, Kyle, & Wilson, 2008;

Paivio, 1991).

Recent work in computational modeling has

attempted to integrate featural information with

distributional information to enrich semantic

representations. For example, Andrews, Vigliocco, and

Vin son (200 9) use d a Bay esi an pro bab ili sti c top ic

model to jointly model semantic representations using

experiential feature-based (e.g., an ostrich <is big>,

<does not fly>, <has feathers> etc.) and linguistic (e.g.,

ostrich and emu co-occur) data as complementary

sources of information. Further, Vigliocco, Meteyard,

Andrews, and Kousta (2009) argued that affective and

internal states can serve as another data source that

could potentially enrich semantic representations,

particularly for abstract concepts that lack

sensorimotor associations (Kousta, Vigliocco, Vinson,

Andrews, & Del Campo, 2011). The information

integration approach has also been applied to other

types of DSMs. For example, Jones and Recchia (2010)

integrated feature-based information with BEAGLE to

show that temporal linguistic information plays a

critical role in generating accurate semantic

representations. Johns and Jones (2012) have also

explored the integration of perceptual information with

linguistic information based on simple associative

mechanisms, borrowing principles from Hintzman’s

(1988) MINERVA architecture and Kwantes’ (2005)

model. Their model provided a proof of concept that

perceptually rich semantic representations may be

constructed by grounding them in already formed or

learned representations of other words (accessible via

feature norms). This notion of grounding

representations in previously learned words has also

been explored by Howell, Jankowicz, and Becker

(2005) using a recurrent NN model. Using a modified

version of the Elman’s (1990) SRN with two additional

output layers for noun and verb features, Howell et al.

trained the model to map phonetically presented input

words (nouns) to semantic features and perform a

grammatical word prediction task. Howell et al. argued

that this type of learning mechanism could be applied

to simulate a “propagation of grounding” effect, where

sensorimotor information from early, concrete words

acquired by children feeds into semantic

representations of novel words, although this proposal

was not formally tested in the paper. Other work on

integrating featural information has explored training a

recurrent NN model with sensorimotor feature inputs

and patterns of co-occurrence to account for a wide

variety of behavioral patterns consistent with normal

and impaired semantic cognition (Hoffman,

McClelland, & Lambon Ralph, 2018), implementing a

feedforward NN to apply feature learning to a simple

word-word co-occurrence model (Durda, Buchanan, &

Caron, 2009) and using feature-based vectors as input

to a random-vector accumulation model (Vigliocco,

Vin son , L ewi s, & G arr ett , 2 004 ).

4.2 Multimodal DSMs

Despite their considerable success, an important

limitation of feature-integrated distributional models is

that the perceptual features available are often

restricted to small datasets (e.g., 541 concrete nouns

from McRae et al., 2005), although some recent work

has attempted to collect a larger dataset of feature

norms (e.g., 4436 concepts; Buchanan, Valentine, &

Maxwell, 2019). Moreover, the features produced in

property generation tasks are potentially prone to

saliency biases (e.g., hardly any participant will

produce the feature <has a head> for a dog because

having a head is not salient or distinctive), and thus can

only serve as an incomplete proxy for all the features

encoded by the brain. To address these concerns, Bruni

et al. (2014) applied advanced computer vision

techniques to automatically extract visual and

linguistic features from multimodal corpora to

construct multimodal distributional semantic

representations. Using a technique called “bag-of-

visual-words” (Sivic & Zisserman, 2003), the model

Figure 8. A depiction of a typical convolutional neural network that detects vertical edges in an image. A sliding filter is

multiplied with the pixelized image to produce a matrix, and then a pooling step combines results from the convolved

output into a smaller matrix by selecting the maximum value from each 2x2 sub-matrix in the convolved matrix. This

final 2x2 matrix represents the final representation of the image highlighting the vertical edges.

discretized visual images and produced visual units

comparable to words in a text document. The resulting

image matrix was then concatenated with a textual

matrix constructed from a natural language corpus

using singular value decomposition to yield a

multimodal semantic representation. Bruni et al.

showed that this model was superior to a purely text-

based approach and successfully predicted semantic

relations between related words (e.g., ostrich-emu),

clustering of words into superordinate concepts (e.g.,

ostrich-bird), and predicting human relatedness

judgments.

This multimodal approach to semantic

representation is currently a thriving area of research

(Feng & Lapata, 2010; Kiela & Bottou, 2014;

Lazaridou, Pham, & Baroni, 2015; Silberer & Lapata,

2012; 2014). Advances in the machine learning

community have majorly contributed to accelerating

the development of these models. In particular,

Convolutional Neural Networks (CNNs) were

introduced as a powerful and robust approach for

automatically extracting meaningful information from

images, visual scenes, and longer text sequences. The

central idea behind CNNs is to apply a non-linear

function (a “filter”) to a sliding window of the full

chunk of information, e.g., pixels in an image, words in

a sentence, etc. The filter transforms the larger window

of information into a fixed d-dimensional vector, that

captures the important properties of the pixels or words

in that window. Convolution is followed by a “pooling”

step, where vectors from different windows are

combined into a single d-dimensional vector, by taking

the maximum or average value of each of the d-

dimensions across the windows. This process extracts

the most important features from a larger set of pixels

(see Figure 8), or the most informative k-grams in a

long sentence. CNNs have been flexibly applied to

different semantic tasks like sentiment analysis and

machine translation (Collobert et al., 2011;

Kalchbrenner et al., 2014), and are currently being used

to develop multimodal semantic models.

Kiela and Bottou (2014) applied CNNs to extract

the most meaningful features from images from a large

image database (ImageNet; Deng et al., 2009) and then

concatenated these image vectors with linguistic

word2vec vectors to produce superior semantic

representations compared to Bruni et al. (2014; also see

Silberer & Lapata, 2014). Lazaridou et al. (2015)

constructed a multimodal word2vec model that was

trained to jointly learn visual and semantic

representations for a subset of words (using image-

based CNNs and word2vec), and this learning was then

generalized to the entire corpus, thus echoing Howell

et al.’s (2011) intuitions of “propagation of grounding”.

Lazaridou et al. also demonstrated how the learning of

abstract words might be grounded in concrete scenes

(e.g., freedom might be the inferred concept from a

scene of a person raising their hands in a protest), an

intuitively powerful proposal that can potentially

demystify the acquisition of abstract concepts but

clearly needs further exploration.

There is also some work within the domain of

associative network models of semantic memory that

has focused on integrating different sources of

information to construct the semantic networks. One

particular line of research has investigated combining

word association norms with featural information, co-

occurrence information, and phonological similarity to

form multiplex networks (Stella, Beckage, & Brede,

2017; Stella, Beckage, Brede, & Dominico, 2018).

Stella et al. (2017) demonstrated that the “layers” in

such a multiplex network differentially influence

language acquisition, with all layers contributing

equally initially but the association layer overtaking the

word learning process with time. This proposal is

similar to the ideas presented earlier regarding how

perceptual or sensorimotor experience might be

important for grounding words acquired earlier, and

words acquired later might benefit from and derive

their representations through semantic associations

with these early experiences (Howell et al., 2005;

Riordan & Jones, 2011). In this sense, one can think of

phonological information and featural information

providing the necessary grounding to early acquired

concepts. This “grounding” then propagates and

enriches semantic associations, which are easier to

access as the vocabulary size increases and individuals

develop more complex semantic representations.

4.3 Summary

Given the success of integrated and multimodal DSMs

memory that use state-of-the-art modeling techniques

to incorporate other modalities to augment linguistic

representations, it appears that the claim that semantic

models are “amodal” and “ungrounded” may need to

be revisited. Indeed, the fact that multimodal semantic

models can adequately encode perceptual features

(Bruni et al., 2014; Kiela & Bottou, 2014) and can

approximate human judgments of taxonomic and

visual similarity (Lazaridou et al., 2015), suggests that

the limitations of previous models (e.g., LSA, HAL

etc.) were more practical than theoretical. Of course,

incorporating other modalities besides vision is critical

to this enterprise, and although there have been some

efforts to integrate sound and olfactory data into

semantic representations (Kiela, Bulat, & Clark, 2015;

Kiela & Clark, 2015; Lopopolo & Miltenburg, 2015),

these approaches are limited by the availability of large

datasets that capture other aspects of embodiment that

may be critical for meaning construction, such as

touch, emotion, and taste. Investing resources in

collecting and archiving multimodal datasets (e.g.,

video data) is an important next step for advancing

research in semantic modeling and broadening our

understanding of the many facets that contribute to the

construction of meaning.

5 Compositional Semantic Representa-

tions

An additional aspect of extending our understanding of

meaning by incorporating other sources of information

is that meaning may be situated within and as part of

higher-order semantic structures like sentence models,

event models, or schemas. Indeed, language is

inherently compositional, in that morphemes combine

to form words, words combine to form phrases, and

phrases combine to form sentences. Moreover,

behavioral evidence from sentential priming studies

indicates that the meaning of words depends on

complex syntactic relations (Morris, 1994). Further, it

is well known that the meaning of a sentence itself is

not merely the sum of the words it contains. For

example, the sentence “John loves Mary” has a

different meaning than “Mary loves John”, despite both

sentences having the same words. Thus, it is important

to consider how compositionality can be incorporated

into and inform existing models of semantic memory.

5.1 Compositional Linguistic Approaches

Associative network models do not have any explicit

way of modeling compositionality, as they propose

representations at the word level that cannot be

straightforwardly scaled to higher-order semantic

structures. On the other hand, distributional models

have attempted to build compositionality into semantic

representations by assigning roles to different entities

in sentences (e.g., in “Mary loves John”, Mary is the

lover and John is the lovee; Dennis, 2004; 2005),

treating frequent phrases as single units and deriving

phrase-based representations (e.g., treating proper

names like New York as a single unit; Bannard,

Baldwin, & Lascarides, 2003; Mikolov et al., 2013b)

or forming pair-pattern matrices (e.g., encoding words

that fulfil the pattern X cuts Y, i.e., mason: stone;

Turney & Pantel, 2010). However, these approaches

were either not scalable for longer phrases or lacked the

ability to model constituent parts separately (Mitchell

& Lapata, 2010). Vector addition (or averaging) is

another common method of combining distributional

semantic representations for different words to form

higher-order vectors (Landauer & Dumais, 1997), but

this method is insensitive to word order and syntax and

produces a blend that does not appropriately extract

meaningful information from the constituent words

(Mitchell & Lapata, 2010).

An alternative method of combining word-level

vectors is through a matrix multiplication technique

called tensor products. Tensor products are a way of

computing pairwise products of the component word

vector elements (Clark, Coecke, & Sadrzadeh, 2008;

Clark & Pulman, 2007; Widdows, 2008), but this

approach suffers from the curse of dimensionality, i.e.,

the resulting product matrix becomes very large as

more individual vectors are combined. Circular

convolution is a special case of tensor products that

compresses the resulting product of individual word

vectors into the same dimensionality (e.g., Jones &

Mewhort, 2007). In a systematic review, Mitchell and

Lapata (2010) examined several compositional

functions applied onto a simple high-dimensional

space model and a topic model space in a phrase

similarity rating task (judging similarity for phrases

like vast amount-large amount, start work-begin

career, good place-high point etc.). Specifically, they

examined how different methods of combining word-

level vectors (e.g., addition, multiplication, pairwise

multiplication using tensor products, circular

convolution etc.) compared in their ability to explain

performance in the phrase similarity task. Their

findings indicated that dilation (a function that

amplified some dimensions of a word when combined

with another word, by differentially weighting the

vector products between the two words) performed

consistently well in both spaces, and circular

convolution was the least successful in judging phrase

similarity. This work sheds light on how simple

compositional operations (like tensor products or

circular convolution) may not sufficiently mimic

human behavior in compositional tasks and may

require modeling more complex interactions between

words (i.e., functions that emphasize different aspects

of a word).

Recent efforts in the machine learning community

have also attempted to tackle semantic

compositionality using Recursive NNs. Recursive NNs

represent a generalization of recurrent NNs that, given

a syntactic parse-tree representation of a sentence, can

generate hierarchical tree-like semantic representations

by combining individual words in a recursive manner

(conditional on how probable the composition would

be). For example, Socher et al. (2012) proposed a

recursive NN to compute compositional meaning

representations. In their model, each word is assigned

a vector that captures its meaning and also a matrix that

contains information about how it modifies the

meaning of another word. This representation for each

word is then recursively combined with other words

using a non-linear composition function (an extension

of work by Mitchell & Lapata, 2010). For example, in

the first iteration, the words very and good may be

combined into a representation (e.g., very good), which

would recursively be combined with movie to produce

the final representation (e.g., very good movie). Socher

et al. showed that this model successfully learned

propositional logic, how adverbs and adjectives

modified nouns, sentiment classification, and complex

semantic relationships (also see Socher et al., 2013).

Other work in this area has explored multiplication-

based models (Yessenalina & Cardie, 2011), LSTM

models (Zhu et al., 2016), and paraphrase-supervised

models (Saluja, Dyer & Ruvini, 2018). Collectively,

this research indicates that modeling the sentence

structure through NN models and recursively applying

composition functions can indeed produce

compositional semantic representations that are

achieving state-of-the-art performance in some

semantic tasks.

5.2 Compositional Event Representations

Another critical aspect of modeling compositionality is

being able to extend representations at the word or

sentence level to higher-level cognitive structures like

events or situations. The notion of schemas as a higher-

level, structured representation of knowledge has been

shown to guide language comprehension (Schank &

Abelson, 1977; for reviews, see Rumelhart, 1991) and

event memory (Bower, Black, & Turner, 1979; Hard,

Tversky, & Lang, 2006). The past few years have seen

promising advances in the field of event cognition

(Elman & McRae, 2019; Franklin et al., 2019;

Reynolds, Zacks, & Braver, 2007; Schapiro et al.,

2013). Importantly, while most event-based accounts

have been conceptual, recent computational models

have attempted to explicitly specify processes that

might govern event knowledge. For example, Elman

and McRae (2019) recently proposed a recurrent NN

model of event knowledge, trained on activity

sequences that make up events. An activity was defined

as a collection of agents, patients, actions, instruments,

states, and contexts, each of which were supplied as

inputs to the network. The task of the network was to

learn the internal structure of an activity (i.e., which

features correlate with a particular activity) and also

predict the next activity in sequence. Elman and McRae

showed that this network was able to infer the co-

occurrence dynamics of activities, and also predict

sequential activity sequences for new events. For

example, when presented with the activity sequence,

“The crowd looks around. The skater goes to the

podium. The audience applauds. The skater receives a

___”, the network activated the words podium and

medal after the fourth sentence (“the audience

applauds”) because both of these are contextually

appropriate (receiving an award at the podium and

receiving a medal), although medal was more activated

than podium as it was more appropriate within that

context. This behavior of the model was strikingly

consistent with N400 amplitudes observed for the same

types of sentences in an ERP study (Metusalem et al.,

2012), indicating that the model was able to make

predictive inferences like human participants.

Franklin et al. (2019) recently proposed a

probabilistic model of event cognition. In their model,

each visual scene had a distributed vector

representation, encoding the features that are relevant

to the scene, which were learned using an unsupervised

CNN. Additionally, scenes contained relational

information that linked specific roles to specific fillers

via circular convolution. A four-layer fully connected

NN with Gated Recurrent Units (GRUs; a type of

recurrent NN) was then trained to predict successive

scenes in the model. Using the Chinese Restaurant

Process, at each timepoint, the model evaluated its

prediction error to decide if its current event

representation was still a good fit. If the prediction

error was high, the model chose whether it should

switch to a different previously-learned event

representation or create an entirely new event

representation, by tuning parameters to evaluate total

number of events and event durations. Franklin et al.

showed that their model successfully learned complex

event dynamics and simulated a wide variety of

empirical phenomena. For example, the model’s ability

to predict event boundaries from unannotated video

data (Zacks et al., 2011) of a person completing

everyday tasks like washing dishes, was highly

correlated with grouped participant data and also

produced similar levels of prediction error across event

boundaries as human participants.

5.3 Summary

This section reviewed some early and recent work at

modeling compositionality, by building higher-level

representations like sentences and events, through

lower-level units like words or discrete time points in

video data. One important limitation of the event

models described above is that they are not models of

semantic memory per se, in that they neither contain

rich semantic representations as input (Franklin et al.,

2019), nor do they explicitly model how linguistic or

perceptual input might be integrated to learn concepts

(Elman & McRae, 2019). Therefore, while there have

been advances in modeling word and sentence-level

semantic representations (Sections I and II), and at the

same time, there has been work on modeling how

individuals experience events (Section IV), there

appears to be a gap in the literature as far as integrating

word-level semantic structures with event-level

representations is concerned. Given the advances in

language modeling discussed in this review, the

integration of structured semantic knowledge (e.g.,

recursive NNs), multimodal semantic models, and

models of event knowledge discussed in this review

represents a promising avenue for future research that

would enhance our understanding of how semantic

memory is organized to represent higher-level

knowledge structures. Another promising line of

research in the direction of bridging this gap comes

from the artificial intelligence literature, where neural

network agents are being trained to learn language in a

simulated grid world full of perceptual and linguistic

information (Bahdanau et al., 2019; Hermann et al.,

2017) using reinforcement learning principles. Indeed,

McClelland, Hill, Rudolph, Baldridge, and Schütze

(2019) recently advocated the need to situate language

within a larger cognitive system. Conceptualizing

semantic memory as part of a broader integrated

memory system consisting of objects, situations, and

the social world is certainly important for the success

of the semantic modeling enterprise.

6 Open Issues and Future Directions

The question of how concepts are represented, stored,

and retrieved is fundamental to the study of all

Tab le 1. Modern computational models of semantic memory.

Group

Model Class

Representative Papers

Input

Mechanism

Network-based

Association networks

De Deyne & Storms, 2008; Kenett et al.,

2011; Steyvers & Tenenbaum, 2005

Free association norms

Wor ds c onn ec te d

by edges

(semantic

relationships)

Multiplex networks

Stella et al., 2017; 2018

Free association norms,

features, co-occurrence,

phonology

Integrate different

sources to

produce multi-

level network

Feature-based

Feature-integrated

models

Andrews, Vigliocco, & Vinson (2009);

Jones & Recchia (2010); Howell et al.

(2005)

Explicitly coded features

& words

Overlap of

features

determines

semantic overlap

between words

Distributional

Semantic Models

(DSMs)

Error-free (Hebbian)

Learning-based

DSMs

Jones & Mewhort, 2007 (BEAGLE);

Landauer & Dumais, 1997 (LSA); Lund &

Burgess, 1996 (HAL)

Wor ds i n a t ext c or pus

Co-occurrence

matrix, often

transformed by

SVD or random

vector

accumulation

Error-driven

Learning-based

(Predictive) DSMs

Neural embedding models (Mikolov et al.,

2013 (word2vec), Bojanowski et al., 2017

(fastText))

Wor ds i n a t ext c or pus

Train neura l

network (NN)-

based word

vectors to perform

semantic task

Top ic Mo del s ( Bl ei, Ng , Jor dan (2 00 3);

Griffiths, Steyvers, & Tenenbaum, 2007)

Wor ds i n a text corpus

Wor d -by-

document

matrix, with

dimensionality

reduction

Convolutional Neural Networks (CNNs;

Collobert et al., 2011; Kalchbrenner et al.,

2014)

Natural images, scenes,

or text

Extract features

from input using

matrix operations

and NNs

Recurrent NNs (e.g., LSTMs, GRUs,

Recursive NNs; Peters et al., 2018 (ELMo);

Socher et al., 2013)

Wor d se que nc es or

syntactic parse trees

Store previous

state of NN model

as recurrent

connections to

inform future

predictions

Attention-NNs (e.g., Bahdanau et al., 2014;

Va sw a n i e t a l . , 2 0 18 ; R a d f or d e t al . , 2 0 1 9

(GPT-2); Brown et al., 2020 (GPT-3);

Devlin et al., 2019 (BERT))

Wor ds i n a t ext c or pus

Use attention

“heads” or vectors

in NNs to focus

on different parts

of input while

minimizing error

Retrieval-based

Jamieson et al., 2019;

Kwantes, 2005;

Wor ds i n a t ext c or pus

Store each

individual word

occurrence and

perform

abstraction-at-

retrieval cued by a

probe

Hybrid DSMs

GloVe (Pennington et al., 2014)

Wor ds i n a t ext c or pus

Uses global co-

occurrence and

prediction

pTCM (Howard et al., 2011)

Wor ds i n a t ext c or pus

Uses retrieval-

based operations

and prediction

vectors

Multimodal DSMs (e.g., Kiela & Bottou,

2014; Lazaridou et al., 2015)

Wor ds a nd im ag es

Combine image

and textual

embeddings

cognition. Over the past few decades, advances in the

fields of psychology, computational linguistics, and

computer science have truly transformed the study of

semantic memory. This paper reviewed classic and

modern models of semantic memory that have

attempted to provide explicit accounts of how semantic

knowledge may be acquired, maintained and used in

cognitive tasks to guide behavior. Table 1 presents a

short summary of the different types of models

discussed in this review, along with their basic

underlying mechanisms. In this concluding section,

some open questions and potential avenues for future

research in the field of semantic modeling will be

discussed.

6.1 Data Availability and Abundance

Within the context of semantic modeling, data is a

double-edged sword. On one hand, the availability of

training data in the form of large text corpora such as

Wikipedia articles, Google News corpora, etc. has led

to an explosion of models such as word2vec (Mikolov

et al., 2013a), fastText (Bojanowski et al., 2017),

GLoVe (Pennington et al., 2014), and ELMo (Peters et

al., 2014) that have outperformed several standard

models of semantic memory traditionally trained on

lesser data. Additionally, with the advent of

computational resources to quickly process even larger

volumes of data using parallel computing, models such

as BERT (Devlin et al., 2019), GPT-2 (Radford et al.,

2019), and GPT-3 (Brown et al., 2020) are achieving

unprecedented success in language tasks like question

answering, reading comprehension, and language

generation. At the same time, however, criticisms of

ungrounded distributional models have led to the

emergence of a new class of “grounded” distributional

models. These models automatically derive non-

linguistic information from other modalities like vision

and speech using convolutional neural networks

(CNNs) to construct richer representations of concepts.

Even so, these grounded models are limited by the

availability of multimodal sources of data, and

consequently there have been recent efforts at

advocating the need for constructing larger databases

of multimodal data (Günther et al., 2019).

On the other hand, training models on more data is

only part of the solution. As discussed earlier, if models

trained on several gigabytes of data perform as well as

young adults who were exposed to far fewer training

examples, it tells us little about human language and

cognition. The field currently lacks systematic

accounts for how humans can flexible use language in

different ways with the impoverished data they are

exposed to. For example, children can generalize their

knowledge of concepts fairly easily from relatively

sparse data when learning language, and only require a

few examples of a concept before they understand its

meaning (Carey & Bartlett, 1978; Landau, Smith, &

Jones, 1988; Xu & Tenenbaum, 2007). Furthermore,

both children and young adults can rapidly learn new

information from a single training example, a

phenomenon referred to as one-shot learning. To

address this particular challenge, several researchers

are now building models than can exhibit few-shot

learning, i.e., learning concepts from only a few

examples, or zero-shot learning, i.e., generalizing

already acquired information to never-seen before data.

Some of these approaches utilize pretrained models

like GPT-2 and GPT-3 trained on very large datasets

and generalizing their architecture to new tasks (Brown

et al., 2020; Radford et al., 2019). While this approach

is promising, it appears to be circular because it still

uses vast amounts of data to build the initial pretrained

representations. Other work in this area has attempted

to implement one-shot learning using Bayesian

generative principles (Lake, Salakhutdinov, &

Ten en bau m, 2 01 5), and i t re main s to be se en h ow

probabilistic semantic representations account for the

generative and creative nature of human language.

6.2 Errors and Degradation in Language

Processing

Another striking aspect of the human language system

is its tendency to break down and produce errors during

cognitive tasks. Analyzing errors in language tasks

provides important cues about the mechanics of the

language system. Indeed, there is considerable work on

tip-of-the-tongue experiences (James & Burke, 2000;

Kumar, Balota, Habbert, Scaltritti, & Maddox, 2019),

speech errors (Dell, 1990), errors in reading (Clay,

1968), language deficits (Hodges & Patterson, 2007;

Shallice, 1988), and age-related differences in

language tasks (Abrams & Farrell, 2011), to suggest

that the cognitive system is prone to interference,

degradation, and variability. However, computational

accounts for how language may be influenced by

interference or degradation remain limited. Early

connectionist models did provide ways of lesioning the

network to account for neuropsychological deficits

such as dyslexia (Hinton & Shallice, 1991; Plaut &

Shallice, 1993) and category-specific semantic deficits

(Farah & McClelland, 2013), and this general approach

has recently been extended to train a recurrent NN

based on sensorimotor and co-occurrence-based

information and simulate behavioral patterns observed

in patients of semantic dementia and semantic aphasia

(Hoffman, McClelland, & Lambon Ralph, 2018).

However, current state-of-the-art language models like

word2vec, BERT, and GPT-2 or GPT-3 do not provide

explicit accounts for how neuropsychological deficits

may arise, or how systematic speech and reading errors

are produced. Furthermore, while there is considerable

empirical work investigating age-related differences in

language processing tasks (e.g., speech errors, picture

naming performance, lexical retrieval, etc.), it is

unclear how current semantic models would account

for these age-related changes, although some recent

work has compared the semantic network structure

between older and younger adults (Dubossarsky, De

Deyne, & Hills, 2017; Wulff, Hills, & Mata, 2018).

Indeed, the deterministic nature of modern machine

learning models is drastically different from the

stochastic nature of human language that is prone to

errors and variability (Kurach et al., 2019).

Computational accounts of how the language system

produces and recovers from errors will be an important

part of building machine learning models that can

mimic human language.

6.3 Communication, Social Collaboration,

and Evolution

Another important aspect of language learning is that

humans actively learn from each other and through

interactions with their social counterparts, whereas the

majority of computational language models assume

that learners are simply processing incoming

information in a passive manner (Günther et al., 2019).

Indeed, there is now ample evidence to suggest that

language evolved through natural selection for the

purposes of gathering and sharing information (Pinker,

2003, p. 27; DeVore & Tooby, 1987), thereby allowing

for personal experiences and episodic information to be

shared among humans (Corballis, 2017a; 2017b).

Consequently, understanding how artificial and human

learners may communicate and collaborate in complex

tasks is currently an active area of research. For

example, some recent work in natural language

processing has attempted to model interactions and

search processes in collaborative language games, such

as Codenames (Kumar, Steyvers, & Balota, under

review.; Shen, Hofer, Felbo, & Levy, 2018, also see

Kim, Ruzmaykin, Truong, & Summerville, 2019),

Passcode (Xu & Kemp, 2010), and navigational games

(Wang, Liang, & Manning, 2016), and suggested that

speakers and listeners do indeed calibrate their

responses based on feedback from their conversational

partner. Another body of work currently being led by

technology giants like Google and OpenAI is focused

on modeling interactions in multiplayer games like

football (Kurach et al., 2019) and Dota 2 (OpenAI,

2019). This work is primarily based on reinforcement

learning principles, where the goal is to train neural

network agents to interact with their environment and

perform complex tasks (Sutton & Barto, 1998).

Although these research efforts are less language-

focused, deep reinforcement learning models have also

been proposed to specifically investigate language

learning. For example, Li et al. (2016) trained a

conversational agent using reinforcement learning, and

a reward metric based on whether the dialogues

generated by the model were easily answerable,

informative, and coherent. Other learning-based

models have used adversarial training, a method by

which a model is trained to produce responses that

would be indistinguishable from human responses (Li

et al., 2017), a modern version of the Turing test (also

see Spranger, Pauw, Loetzsch, & Steels, 2012).

However, these recent attempts are still focused on

independent learning, whereas psychological and

linguistic research suggests that language evolved for

purposes of sharing information, which likely has

implications for how language is learned in the first

place. Clearly, this line of work is currently in its

nascent stages and requires additional research to fully

understand and model the role of communication and

collaboration in developing semantic knowledge.

6.4 Multilingual Semantic Models

A co mputa tional model can only be considered a m odel

of semantic memory if it can be broadly applied to any

semantic memory system and does not depend on the

specific language of training. Therefore, an important

challenge for computational semantic models is to be

able to generalize the basic mechanisms of building

semantic representations from English corpora to other

languages. Some recent work has applied character-

level CNNs to learn the rich morphological structure of

languages like Arabic, French, and Russian (Kim,

Jernite, Sontag, & Rush, 2015; also see Botha &

Blunsom 2014; Luong, Socher, & Manning 2013).

These approaches clearly suggest that pure word-level

models that have occupied centerstage in the English

language modeling community may not work as well

in other languages, and subword information may in

fact be critical in the language learning process. More

recent embeddings like fastText (Bojanowski et al.,

2017) that are trained on sub-lexical units are a

promising step in this direction. Furthermore,

constructing multilingual word embeddings that can

represent words from multiple languages in a single

distributional space is currently a thriving area of

research in the machine learning community (e.g.,

Chen & Cardie, 2018; Conneau, Lample, Ranzato,

Denoyer, & Jegou, 2018). Overall, evaluating modern

machine learning models on other languages can

provide important insights about language learning and

is therefore critical to the success of the language

modeling enterprise.

6.5 Revisiting Benchmarks for Semantic

Models

A cri tic al issue that has no t rec eived ad equate a ttenti on

in the semantic modeling field is the quality and nature

of benchmark test datasets that are often considered the

final word for comparing state-of-the-art machine

learning-based language models. The General

Language Understanding Evaluation (GLUE; Wang et

al., 2018) benchmark was recently proposed as a

collection of language-based task datasets, including

the Corpus of Linguistic Acceptability (CoLA;

Warstadt et al ., 2 01 8) , t he S ta nf or d S en timent Tr ee ba nk

(Socher et al., 2013), and the Winograd Schema

Challenge (Levesque, Davis, & Morgenstern, 2011),

among a total of eleven language tasks. Other popular

benchmarks in the field include decaNLP (McCann,

Keskar, Xiong, & Socher, 2018), the Stanford Question

Answering Dataset (SQuAD; Rajpurkar et al., 2018),

Word S im il ar it y Te st C ol le ct io n ( Wor dS im -33;

Finkelstein et al., 2002) among others. While these

benchmarks offer a standardized method of comparing

performance across models, several of the tasks

included within these benchmark datasets either consist

of crowdsourced information collected from an

unknown number of participants (e.g., SQuAD), scores

or annotations based on very few human participants

(e.g., 16 participants assessed similarity for 200 word-

pairs in the WordSim-33 dataset), or sometimes

datasets with no established human benchmark (e.g.,

the GLUE Diagnostic dataset, Wang et al., 2018). This

is in contrast to more psychologically motivated

models (e.g., semantic network models, BEAGLE,

Tem po ral C ontext M odel, e tc .), w here m odel

performance is often compared against human

baselines, for example in predicting accuracy or

response latencies to perform a particular task, or

through large-scale normed databases of human

performance in semantic tasks (e.g., English Lexicon

Project; Balota et al., 2007; Semantic Priming Project;

Hutchison et al., 2013). Therefore, to evaluate whether

state-of-the-art machine learning models like ELMo,

BERT and GPT-2 are indeed plausible psychological

models of semantic memory, it is important to not only

establish human baselines for benchmark tasks in the

machine learning community, but also explicitly

compare model performance to human baselines in

both accuracy and response times. There have been

some recent efforts in this direction. For example,

Bender (2015) tested over 400 Amazon Mechanical

Turk users on the Winograd Schema Challenge (a task

that requires the use of world knowledge,

commonsense reasoning and anaphora resolution) and

provided quantitative baselines for accuracy and

response times that should provide useful benchmarks

to compare machine learning models in the extent to

which they explain human behavior (also see

Morgenstern, Davis, & Ortiz, 2016). Further, Chen,

Peterson, and Griffiths (2017) compared the

performance of the word2vec model against human

baselines of solving analogies using relational

similarity judgments to show that word2vec

successfully captures only a subset of analogy

relations. Additionally, Lazaridou, Marelli, and Baroni

(2017) recently compared the performance of their

multimodal skip-gram model (Lazaridou et al., 2015)

against human relatedness judgments to visual and

word cues for newly learned concepts to show that the

model performed very similar to human participants.

Despite these promising studies, such efforts remain

limited due to the goals of machine learning often being

application-focused and the goals of psychology being

explanation-focused. Explicitly comparing model

performance to behavioral task performance represents

an important next step towards reconciling these two

fields, and also combining representational and

process-based accounts of how semantic memory

guides cognitive behavior.

6.6 Prioritizing Mechanistic Accounts

Despite the lack of systematic comparisons to human

baselines, an important takeaway that emerges from

this review is that several state-of-the-art language

models such as word2vec (Mikolov et al., 2013),

ELMo (Peters et al., 2018), BERT (Devlin et al., 2018),

GPT-2 (Radford et al., 2019), and GPT-3 (Brown et al.,

2020) do indeed show impressive performance across

a wide variety of semantic tasks such as

summarization, question answering, and sentiment

analysis. However, despite their success, relatively

little is known about how these models are able to

produce this complex behavior, and exactly what is

being learned by them in their process of building

semantic representations. Indeed, there is some

skepticism in the field about whether these models are

truly learning something meaningful or simply

exploiting spurious statistical cues in language, which

may or may not reflect human learning. For example,

Niven and Kao (2019) recently evaluated BERT’s

performance in a complex argument reasoning

comprehension task, where world knowledge was

critical for evaluating a particular claim. For example,

to evaluate the strength of the claim “Google is not a

harmful monopoly”, an individual may reason that

“people can choose not to use Google”, and also

provide the additional warrant that “other search

engines do not redirect to Google” to argue in favor of

the claim. On the other hand, if the alternative, “all

other search engines redirect to Google” is true, then

the claim would be false. Niven and Kao found that

BERT was able to achieve state-of-the-art performance

with 77% accuracy in this task, without any explicit

world knowledge. For example, knowing what a

monopoly might mean in this context (i.e., restricting

consumer choices) and that Google is a search engine

are critical pieces of knowledge required to evaluate

the claim. Further analysis showed that BERT was

simply exploiting statistical cues in the warrant (i.e.,

the word “not”) to evaluate the claim, and once this cue

was removed through an adversarial test dataset,

BERT’s performance dropped to chance levels (53%).

The authors concluded that BERT was not able to learn

anything meaningful about argument comprehension,

even though the model performed better than other

LSTM and vector-based models and was only a few

points below the human baseline on the original task

(also see Zellers et al., 2019 for a similar demonstration

on a commonsense-based inference task).

These results are especially important if state-of-

the-art models like word2vec, ELMo, BERT or GPT-

2/3 are to be considered plausible models of semantic

memory in any manner and certainly underscore the

need to focus on mechanistic accounts of model

behavior. Understanding how machine learning models

arrive at answers to complex semantic problems is as

important as simply evaluating how many questions

the model was able to answer. Humans not only extract

complex statistical regularities from natural language

and the environment, but also form semantic structures

of world knowledge that influence their behavior in

tasks like complex inference and argument reasoning.

Therefore, explicitly testing machine learning models

on the specific knowledge they have acquired will

become extremely important in ensuring that the

models are truly learning meaning and not simply

exhibiting the “Clever Hans” effect (Heinzerling,

2019). To that end, explicit process-based accounts that

shed light on the cognitive processes operating on

underlying semantic representations across different

semantic tasks may be useful in evaluating the

psychological plausibility of different models. For

instance, while distributional models perform well on a

broad range of semantic tasks on average (Bullinaria &

Levy, 2012; Mandera et al., 2017), it is unclear why

their performance is better on tasks like synonym

detection (Bullinaria & Levy, 2012) and similarity

judgments (Bruni et al., 2014) and worse for semantic

priming effects (Hutchison, Balota, Cortese, & Watson,

2008; Mandera et al., 2017), free association (Griffiths,

Steyvers, & Tenenbaum, 2007; Kennett et al., 2017),

and complex inference tasks (Niven & Kao, 2019). A

promising step towards understanding how

distributional models may dynamically influence task

performance was taken by Rotaru, Vigliocco, and

Frank (2018), who recently showed that combining

semantic network-based representations derived from

LSA, GloVe, and word2vec with a dynamic spreading-

activation framework significantly improved the

predictive power of the models on semantic tasks. In

light of this work, testing competing process-based

models (e.g., spreading activation, drift-diffusion,

temporal context etc.) and structural or representational

accounts of semantic memory (e.g., prediction-based,

topic models, etc.) represents the next step in fully

understanding how structure and processes interact to

produce complex behavior.

7 Conclusion

The nature of knowledge representation and the

processes used to retrieve that knowledge in response

to a given task will continue to be the center of

considerable theoretical and empirical work across

multiple fields including philosophy, linguistics,

psychology, computer science, and cognitive

neuroscience. The ultimate goal of semantic modeling

is to propose one architecture that can simultaneously

integrate perceptual and linguistic input to form

meaningful semantic representations, which in turn

naturally scales up to higher-order semantic structures,

and also performs well in a wide range of cognitive

tasks. Given the recent advances in developing

multimodal DSMs, interpretable and generative topic

models, and attention-based semantic models, this goal

at least appears to be achievable. However, some

important challenges still need to be addressed before

the field will be able to integrate these approaches and

design a unified architecture. For example, addressing

challenges like one-shot learning, language-related

errors and deficits, the role of social interactions, and

the lack of process-based accounts will be important in

furthering research in the field. Although the current

modeling enterprise has come very far in decoding the

statistical regularities humans use to learn meaning

from the linguistic and perceptual environment, no

single model has been successfully able to account for

the flexible and innumerable ways in which humans

acquire and retrieve knowledge. Ultimately, integrating

lessons learned from behavioral studies, such as the

interaction of world knowledge, linguistic and

environmental context, and attention in complex

cognitive tasks with computational techniques that

focus on quantifying association, abstraction, and

prediction will be critical in developing a complete

theory of language.

8 References

Abbott, J. T., Austerweil, J. L., & Griffiths, T. L.

(2015, July). Random walks on semantic networks

can resemble optimal foraging. In Neural

Information Processing Systems Conference. 22(3).

558. American Psychological Association.

Abrams, L., & Farrell, M. T. (2011). Language

processing in normal aging. The Handbook of

Psycholinguistic and Cognitive processes:

Perspectives in Communication Sisorders, 49-73.

Alammar, J. (2018). The Illustrated Transformer.

Retrieved from

https://jalammar.github.io/illustrated-transformer/.

Anderson, J. R. (2000). Learning and Memory: An

Integrated Approach. John Wiley & Sons Inc.

Andrews, M., & Vigliocco, G. (2010). The hidden

Markov topic model: A probabilistic model of

semantic representation. Top ic s i n Co gn it iv e

Science, 2(1), 101-113.

Andrews, M., Vigliocco, G., & Vinson, D. (2009).

Integrating experiential and distributional data to

learn semantic representations. Psychological

Review, 116(3), 463.

Albert, R., Jeong, H., & Barabási, A. L. (2000). Error

and attack tolerance of complex

networks. Nature, 406(6794), 378.

Ashby, F. G., & Maddox, W. T. (1993). Relations

between prototype, exemplar, and decision bound

models of categorization. Journal of Mathematical

Psychology, 37(3), 372-400.

Asr, F. T., Willits, J., & Jones, M. (2016). Comparing

Predictive and Co-occurrence Based Models of

Lexical Semantics Trained on Child-directed

Speech. Proceedings of the Annual Meeting of the

Cognitive Science Society.

Aver y, J. , Jones , M. N. (20 18 ). Comp ar in g models of

semantic fluency: Do humans forage optimally, or

walk randomly? In Proceedings of the 40th Annual

Meeting of the Cognitive Science Society. 118-123.

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural

machine translation by jointly learning to align and

translate. arXiv preprint arXiv:1409.0473.

Bahdanau, D., Hill, F., Leike, J., Hughes, E., Hosseini,

A., Kohli, P., & Grefenstette, E. (2018). Learning to

understand goal specifications by modelling

reward. arXiv preprint arXiv:1806.01946.

Balota, D. A., & Coane, J. H. (2008). Semantic

memory. In Byrne JH, Eichenbaum H, Mwenzel R,

Roediger III HL, Sweatt D (Eds.). Learning and

Memory: A Comprehensive Reference (pp. 511–34).

Amsterdam: Elsevier.

Balota, D. A., & Lorch, R. F. (1986). Depth of

automatic spreading activation: Mediated priming

effects in pronunciation but not in lexical

decision. Journal of Experimental Psychology:

Learning, Memory, and Cognition, 12(3), 336.

Balota, D. A., & Paul, S. T. (1996). Summation of

activation: Evidence from multiple primes that

converge and diverge within semantic memory.

Journal of Experimental Psychology: Learning,

Memory & Cognition, 22, 827–845.

Balota, D. A., & Yap, M. J. (2006). Attentional

control and flexible lexical processing: Explorations

of the magic moment of word recognition. In S.

Andrews (Ed.), From inkmarks to ideas: Current

issues in Lexical processing, 229-258.

Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese,

M. J., Kessler, B., Loftis, B., ... & Treiman, R.

(2007). The English lexicon project. Behavior

Research Methods, 39(3), 445-459.

Bannard, C., Baldwin, T., & Lascarides, A. (2003). A

statistical approach to the semantics of verb-

particles. In Proceedings of the ACL 2003 workshop

on Multiword expressions: analysis, acquisition and

treatment (pp. 65-72).

Barabási, A. L., & Albert, R. (1999). Emergence of

scaling in random networks. Science, 286(5439),

509-512.

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't

count, predict! A systematic comparison of context-

counting vs. context-predicting semantic vectors.

In Proceedings of the 52nd Annual Meeting of the

Association for Computational Linguistics (Volume

1: Long Papers) (Vol. 1, pp. 238-247).

Baroni, M., & Lenci, A. (2008). Concepts and

properties in word spaces. Italian Journal of

Linguistics, 20(1), 55-88.

Baroni, M., Murphy, B., Barbu, E., & Poesio, M.

(2010). Strudel: A corpus‐based semantic model

based on properties and types. Cognitive

Science, 34(2), 222-254.

Barros-Loscertales, A., González, J., Pulvermüller, F.,

Ven t ur a -Campos, N., Bustamante, J. C., Costumero,

V. , . . . & Á v i l a , C . ( 2 0 1 1 ) . R e a d i n g s a l t a c t i v a t e s

gustatory brain regions: fMRI evidence for

semantic grounding in a novel sensory

modality. Cerebral Cortex, 22(11), 2554-2563.

Barsalou, L. W. (1999). Perceptual symbol

systems. Behavioral and Brain sciences, 22(4),

577-660.

Barsalou, L. W. (2003). Abstraction in perceptual

symbol systems. Philosophical Transactions of the

Royal Society of London. Series B: Biological

Sciences, 358(1435), 1177-1187.

Barsalou, L. W. (2008). Grounded cognition. Annu.

Rev. Psychol., 59, 617-645.

Barsalou, L. W. (2016). On staying grounded and

avoiding quixotic dead ends. Psychonomic Bulletin

& Review, 23(4), 1122-1142.

Barsalou, L. W., Santos, A., Simmons, W. K., &

Wilson, C. D. (2008). Language and simulation in

conceptual processing. Symbols, Embodiment, and

Meaning, 245-283.

Barsalou, L. W., & Wiemer-Hastings, K. (2005).

Situating abstract concepts. Grounding Cognition:

The role of Perception and Action in Memory,

Language, and Thought, 129-163.

Beaty, R. E., Kaufman, S. B., Benedek, M., Jung, R.

E., Kenett, Y. N., Jauk, E., ... & Silvia, P. J. (2016).

Personality and complex brain networks: The role

of openness to experience in default network

efficiency. Human Brain Mapping, 37(2), 773-779.

Bender, D. (2015). Establishing a Human Baseline for

the Winograd Schema Challenge. In MAICS (pp.

39-45). Retrieved from

https://pdfs.semanticscholar.org/1346/3717354ab61

348a0141ebd3b0fdf28e91af8.pdf.

Bengio, Y., Goodfellow, I. J., & Courville, A. (2015).

Deep learning, book in preparation for MIT press

(2015).

Binder, K. S. (2003). Sentential and discourse topic

effects on lexical ambiguity processing: An eye

movement examination. Memory &

Cognition, 31(5), 690-702.

Binder, K. S., & Rayner, K. (1998). Contextual

strength does not modulate the subordinate bias

effect: Evidence from eye fixations and self-paced

reading. Psychonomic Bulletin & Review, 5(2), 271-

276.

Blei, D. M. (2012). Probabilistic topic

models. Communications of the ACM, 55(4), 77-84.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of Machine Learning

Research, 3, 993-1022.

Block, C. K., & Baldwin, C. L. (2010). Cloze

probability and completion norms for 498

sentences: Behavioral and neural validation using

event-related potentials. Behavior Research

Methods, 42(3), 665-670.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T.

(2017). Enriching word vectors with subword

information. Trans actions of the As socia tion for

Computational Linguistics, 5, 135-146.

Borghesani, V., & Piazza, M. (2017). The neuro-

cognitive representations of symbols: The case of

concrete words. Neuropsychologia, 105, 4-17.

Botha, J., & Blunsom, P. (2014, January).

Compositional morphology for word

representations and language modelling.

In International Conference on Machine

Learning (pp. 1899-1907). Retrieved from

http://proceedings.mlr.press/v32/botha14.pdf.

Bower, G. H., Black, J. B., & Turner, T. J. (1979).

Scripts in memory for text. Cognitive

Psychology, 11(2), 177-220.

Bransford, J. D., & Johnson, M. K. (1972). Contextual

prerequisites for understanding: Some

investigations of comprehension and recall. Journal

of Verbal Learning and Verbal Behavior, 11(6),

717-726.

Brown, T. B., Mann, B., Ryder, N., Subbiah, M.,

Kaplan, J., Dhariwal, P., ... & Agarwal, S. (2020).

Language models are few-shot learners. arXiv

preprint arXiv:2005.14165. Retrieved from

https://arxiv.org/pdf/2005.14165.pdf.

Bruni, E., Tran, N. K., & Baroni, M. (2014).

Multimodal distributional semantics. Journal of

Artificial Intelligence Research, 49, 1-47.

Bub, D. N., & Masson, M. E. (2012). On the

dynamics of action representations evoked by

names of manipulable objects. Journal of

Experimental Psychology: General, 141(3), 502.

Buchanan, E. M., Valentine, K. D., & Maxwell, N. P.

(2019). English semantic feature production norms:

An extended database of 4436 concepts. Behavior

Research Methods, 1-15

Bullinaria, J. A., & Levy, J. P. (2007). Extracting

semantic representations from word co-occurrence

statistics: A computational study. Behavior

Research Methods, 39(3), 510-526.

Burgess, C. (2000). Theory and operational

definitions in computational memory models: A

response to Glenberg and Robertson. Journal of

Memory and Language, 43(3), 402-408.

Burgess, C. (2001). Representing and resolving

semantic ambiguity: A contribution from high-

dimensional memory modeling. On the

consequences of meaning selection: Perspectives on

resolving lexical ambiguity, 233-260.

Carey, S., & Bartlett, E. (1978). Acquiring a single

new word. Papers and Reports on Child Language

Development, 15, 17–29

Chen, X., Cardie, C. (2018). Unsupervised

Multilingual Word Embeddings. In Proceedings of

the 2018 Conference on Empirical Methods in

Natural Language Processing (EMNLP 2018).

Retrieved from https://arxiv.org/abs/1808.08933.

Chen, D., Peterson, J. C., & Griffiths, T. L. (2017).

Evaluating vector-space models of analogy. In

Proceedings of the 39th Annual Conference of the

Cognitive Science Society. Retrieved from

https://arxiv.org/abs/1705.04416.

Cho, K., Van Merriënboer, B., Gulcehre, C.,

Bahdanau, D., Bougares, F., Schwenk, H., &

Bengio, Y. (2014). Learning phrase representations

using RNN encoder-decoder for statistical machine

translation. arXiv preprint arXiv:1406.1078.

Retrieved from https://arxiv.org/abs/1406.1078.

Chwilla, D. J., & Kolk, H. H. (2002). Three-step

priming in lexical decision. Memory &

Cognition, 30(2), 217-225.

Clark, S., Coecke, B., & Sadrzadeh, M. (2008). A

compositional distributional model of meaning.

In Proceedings of the Second Quantum Interaction

Symposium (QI-2008) (pp. 133-140).

Clark, K., Khandelwal, U., Levy, O., & Manning, C.

D. (2019). What Does BERT Look At? An Analysis

of BERT's Attention. arXiv preprint

arXiv:1906.04341. Retrieved from

https://arxiv.org/abs/1906.04341.

Clark, S., & Pulman, S. (2007). Combining symbolic

and distributional models of meaning. Retrieved

from

https://www.aaai.org/Papers/Symposia/Spring/2007

/SS-07-08/SS07-08-008.pdf.

Clay, M. M. (1968). A syntactic analysis of reading

errors. Journal of Verbal Learning and Verbal

Behavior, 7(2), 434-438.

Collell, G., & Moens, M. F. (2016). Is an image worth

more than a thousand words? on the fine-grain

semantic differences between visual and linguistic

representations. In Proceedings of the 26th

International Conference on Computational

Linguistics (pp. 2807-2817). ACL.

Collins, A. M., & Loftus, E. F. (1975). A spreading-

activation theory of semantic

processing. Psychological Review, 82(6), 407.

Collins, A. M., & Quillian, M. R. (1969). Retrieval

time from semantic memory. Journal of Verbal

Learning and Verbal Behavior, 8(2), 240-247.

Collobert, R., & Weston, J. (2008). A unified

architecture for natural language processing: Deep

neural networks with multitask learning. In

Proceedings of the 25th international conference on

Machine learning,160-167. ACM.

Collobert, R., Weston, J., Bottou, L., Karlen, M.,

Kavukcuoglu, K., & Kuksa, P. (2011). Natural

language processing (almost) from scratch. The

Journal of Machine Learning Research, 12, 2493-

2537.

Corballis, M. C. (2017a). Language evolution: a

changing perspective. Trends in Cogniti ve

Sciences, 21(4), 229-236.

Corballis, M. C. (2017b). The evolution of language.

In J. Call, G. M. Burghardt, I. M. Pepperberg, C. T.

Snowdon, & T. Zentall (Eds.), APA handbooks in

psychology®. APA handbook of comparative

psychology: Basic concepts, methods, neural

substrate, and behavior (p. 273–297). American

Psychological

Association. https://doi.org/10.1037/0000011-014

Damasio, A. R. (1989). Time-locked multiregional

retroactivation: A systems-level proposal for the

neural substrates of recall and

recognition. Cognition, 33(1-2), 25-62.

De Deyne, S., Kenett, Y. N., Anaki, D., Faust, M., &

Navarro, D. J. (2016). Large-scale network

representations of semantics in the mental

lexicon. Big Data in Cognitive Science: From

Methods to Insights, 174-202.

De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert,

M., & Storms, G. (2019). The “Small World of

Words” Engl is h wo rd a ss oc ia ti on n or ms f or over

12,000 cue words. Behavior Research

Methods, 51(3), 987-1006.

De Deyne, S., Perfors, A., & Navarro, D. J. (2016,

December). Predicting human similarity judgments

with distributional models: The value of word

associations. In Proceedings of COLING 2016, the

26th International Conference on Computational

Linguistics: Technical Papers (pp. 1861-1870).

De Deyne, S., & Storms, G. (2008). Word

associations: Network and semantic properties.

Behavior Research Methods, 40(1), 213-231.

Deese, J. (1959). On the prediction of occurrence of

particular verbal intrusions in immediate

recall. Journal of Experimental Psychology, 58(1),

17.

Dell, G. S. (1990). Effects of frequency and

vocabulary type on phonological speech

errors. Language and Cognitive Processes, 5(4),

313-349.

Denhière, G., Lemaire, B., Bellissens, C., & Jhean-

Larose, S. (2007). A semantic space for modeling

children's semantic memory. Handbook of Latent

Semantic Analysis, 143-165.

Dennis, S. (2004). An unsupervised method for the

extraction of propositional information from text.

Proceedings of the National Academy of Sciences of

the United States of America, 101(Suppl 1), 5206-

5213.

Dennis, S. (2005). A memory‐based theory of verbal

cognition. Cognitive Science, 29(2), 145-193.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.

(2019). BERT: Pre-training of deep bidirectional

transformers for language understanding. arXiv

preprint arXiv:1810.04805.

DeVore, I., & Tooby, J. (1987). The reconstruction of

hominid behavioral evolution through strategic

modeling. The Evolution of Human Behavior:

Primate Models, edited by WG Kinzey, 183-237.

Dove, G. (2011). On the need for embodied and dis-

embodied cognition. Frontiers in Psychology, 1,

242.

Dubossarsky, H., De Deyne, S., & Hills, T. T. (2017).

Quantifying the structure of free association

networks across the life span. Developmental

Psychology, 53(8), 1560.

Duffy, S. A., Morris, R. K., & Rayner, K. (1988).

Lexical ambiguity and fixation times in

reading. Journal of Memory and Language, 27(4),

429-446.

Durda, K., Buchanan, L., & Caron, R. (2009).

Grounding co-occurrence: Identifying features in a

lexical co-occurrence model of semantic

memory. Behavior Research Methods, 41(4), 1210-

1223.

Elman, J. L. (1990). Finding structure in

time. Cognitive Science, 14(2), 179-211.

Elman, J. L. (1991). Distributed representations,

simple recurrent networks, and grammatical

structure. Machine Learning, 7(2-3), 195-225.

Elman, J. L., & McRae, K. (2019). A model of event

knowledge. Psychological Review, 126(2), 252.

Farah, M. J., & McClelland, J. L. (2013). A

computational model of semantic memory

impairment: Modality specificity and emergent

category specificity (Journal of Experimental

Psychology: General, 120 (4), 339–357).

In Exploring Cognition: Damaged Brains and

Neural Networks (pp. 79-110). Psychology Press.

Federmeier, K. D., & Kutas, M. (1999). A rose by any

other name: Long-term memory structure and

sentence processing. Journal of Memory and

Language, 41(4), 469-495.

Fellbaum, C. (Ed.). (1998). Wo rd Ne t , a n el e ct ro ni c

lexical database. Cambridge, MA: MIT Press.

Feng, Y., & Lapata, M. (2010, June). Visual

information in semantic representation. In Human

Language Technologies: The 2010 Annual

Conference of the North American Chapter of the

Association for Computational Linguistics (pp. 91-

99). Association for Computational Linguistics.

Fernandino, L., Conant, L. L., Binder, J. R.,

Blindauer, K., Hiner, B., Spangler, K., & Desai, R.

H. (2013). Where is the action? Action sentence

processing in Parkinson's

disease. Neuropsychologia, 51(8), 1510-1517.

Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E.,

Solan, Z., Wolfman, G., & Ruppin, E. (2002).

Placing search in context: The concept

revisited. ACM Transactions on information

systems, 20(1), 116-131.

Firth, J. R. (1957). A synopsis of linguistic theory,

1930-1955. In Philological Society (Great Britain)

(Ed.), Studies in Linguistic Analysis. Oxford:

Blackwell.

Fischler, I. (1977). Semantic facilitation without

association in a lexical decision task. Memory &

Cognition, 5, 335-339.

Franklin, N., Norman, K. A., Ranganath, C., Zacks, J.

M., & Gershman, S. J. (2019). Structured event

memory: a neuro-symbolic model of event

cognition. BioRxiv, 541607. Retrieved from

https://www.biorxiv.org/content/biorxiv/early/2019/

02/05/541607.full.pdf.

Fried, E. I., van Borkulo, C. D., Cramer, A. O. J.,

Boschloo, L., Schoevers, R. A., & Borsboom, D

(2017). Mental disorders as networks of problems: a

review of recent insights. Social Psychiatry and

Psychiatric Epidemiology, 52(1), pp. 1–10.

Gabrieli, J. D., Cohen, N. J., & Corkin, S. (1988). The

impaired learning of semantic knowledge following

bilateral medial temporal-lobe resection. Brain and

Cognition, 7(2), 157-177.

Garagnani, M., & Pulvermüller, F. (2016). Conceptual

grounding of language in action and perception: a

neurocomputational model of the emergence of

category specificity and semantic hubs. European

Journal of Neuroscience, 43(6), 721-737.

Glenberg, A. M., & Robertson, D. A. (2000). Symbol

grounding and meaning: A comparison of high-

dimensional and embodied theories of

meaning. Journal of Memory and Language, 43(3),

379-401.

Goldberg, R. F., Perfetti, C. A., & Schneider, W.

(2006). Perceptual knowledge retrieval activates

sensory brain regions. Journal of

Neuroscience, 26(18), 4917-4921.

Griffiths, T. L., & Steyvers, M. (2002). A probabilistic

approach to semantic representation. In Proceedings

of the Annual meeting of the Cognitive Science

Society, 24(24).

Griffiths, T. L., & Steyvers, M. (2003). Prediction and

semantic association. In Advances in Neural

Information Processing Systems, 11-18.

Griffiths, T. L., & Steyvers, M. (2004). Finding

scientific topics. Proceedings of the National

Academy of Sciences, 101(1), 5228-5235.

Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B.

(2007). Topics in semantic representation.

Psychological Review, 114(2), 211.

Gruenenfelder, T. M., Recchia, G., Rubin, T., & Jones,

M. N. (2016). Graph‐theoretic properties of

networks based on word association norms:

implications for models of lexical semantic

memory. Cognitive Science, 40(6), 1460-1495.

Guida, A., & Lenci, A. (2007). Semantic properties of

word associations to Italian verbs. Italian Journal of

Linguistics, 19(2), 293-326.

Günther, F., Rinaldi, L., & Marelli, M. (2019). Vector-

space models of semantic representation from a

cognitive perspective: A discussion of common

misconceptions. Perspectives on Psychological

Science, 14(6), 1006-1033.

Hard, B. M., Tversky, B., & Lang, D. S. (2006).

Making sense of abstract events: Building event

schemas. Memory & Cognition, 34(6), 1221-1235.

Harnad, S. (1990). The symbol grounding

problem. Physica D: Nonlinear Phenomena, 42(1-

3), 335-346.

Harris, Z. (1970). Distributional structure. In Papers

in Structural and Transformational Linguistics (pp.

775-794). Dordrecht, Holland: D. Reidel Publishing

Company.

Hebb, D. (1949). The organization of learning.

Cambridge, MA: MIT Press.

Heinzerling, B. (2019). NLP's Clever Hans Moment

has Arrived. Retrieved from

https://thegradient.pub/nlps-clever-hans-moment-

has-arrived/.

Hermann, K. M., Hill, F., Green, S., Wang, F.,

Faulkner, R., Soyer, H., ... & Wainwright, M.

(2017). Grounded language learning in a simulated

3d world. arXiv preprint arXiv:1706.06551.

Retrieved from https://arxiv.org/abs/1706.06551.

Hills, T. T. (2006). Animal foraging and the evolution

of goal‐directed cognition. Cognitive

Science, 30(1), 3-41.

Hills, T. T., Jones, M. N., & Todd, P. M. (2012).

Optimal foraging in semantic memory.

Psychological Review, 119(2), 431.

Hinton, G. E., & Shallice, T. (1991). Lesioning an

attractor network: Investigations of acquired

dyslexia. Psychological Review, 98(1), 74.

Hintzman, D. L. (1988). Judgments of frequency and

recognition memory in a multiple-trace memory

model. Psychological Review, 95(4), 528.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-

term memory. Neural Computation, 9(8), 1735-

1780.

Hodges, J. R., & Patterson, K. (2007). Semantic

dementia: a unique clinicopathological

syndrome. The Lancet Neurology, 6(11), 1004-

1014.

Hoffman, P., McClelland, J. L., & Lambon Ralph, M.

A. (2018). Concepts, control, and context: A

connectionist account of normal and disordered

semantic cognition. Psychological Review, 125(3),

293.

Howard, M. W., & Kahana, M. J. (2002). A distributed

representation of temporal context. Journal of

Mathematical Psychology, 46, 269-299.

Howard, M. W., Shankar, K. H., & Jagadisan, U. K.

(2011). Constructing semantic representations from

a gradually changing representation of temporal

context. Top ic s in Cogni ti ve Sc ien ce , 3(1), 48-73.

Howell, S. R., Jankowicz, D., & Becker, S. (2005). A

model of grounded language acquisition:

Sensorimotor features improve lexical and

grammatical learning. Journal of Memory and

Language, 53(2), 258-276.

Hutchison, K. A. (2003). Is semantic priming due to

association strength or feature overlap? A

microanalytic review. Psychonomic Bulletin &

Review, 10(4), 785-813.

Hutchison, K. A., & Balota, D. A. (2005). Decoupling

semantic and associative information in false

memories: Explorations with semantically

ambiguous and unambiguous critical lures. Journal

of Memory and Language, 52(1), 1-28.

Hutchison, K. A., Balota, D. A., Neely, J. H., Cortese,

M. J., Cohen-Shikora, E. R., Tse, C. S., ... &

Buchanan, E. (2013). The semantic priming

project. Behavior Research Methods, 45(4), 1099-

1114.

James, L. E., & Burke, D. M. (2000). Phonological

priming effects on word retrieval and tip-of-the-

tongue experiences in young and older

adults. Journal of Experimental Psychology:

Learning, Memory, and Cognition, 26(6), 1378.

Jamieson, R. K., Avery, J. E., Johns, B. T., & Jones,

M. N. (2018). An instance theory of semantic

memory. Computational Brain & Behavior, 1(2),

119-136.

Johns, B. T., & Jones, M. N. (2015). Generating

structure from experience: A retrieval-based model

of language processing. Canadian Journal of

Experimental Psychology/Revue canadienne de

psychologie expérimentale, 69(3), 233.

Johns, B. T., Jones, M. N., & Mewhort, D. J. K.

(2019). Using experiential optimization to build

lexical representations. Psychonomic Bulletin &

Review, 26(1), 103-126.

Johns, B. T., Mewhort, D. J., & Jones, M. N. (2019).

The Role of Negative Information in Distributional

Semantic Learning. Cognitive Science, 43(5),

e12730.

Jones, M. N. (2018). When does abstraction occur in

semantic memory: insights from distributional

models. Language, Cognition and Neuroscience, 1-

Jones, M. N., Dye, M., & Johns, B. T. (2017). Context

as an organizing principle of the lexicon.

In Psychology of Learning and Motivation (Vol. 67,

pp. 239-283). Academic Press.

Jones, M. N., Gruenenfelder, T. M., & Recchia, G.

(2018). In defense of spatial models of semantic

representation. New Ideas in Psychology, 50, 54-60.

Jones, M. N., Hills, T. T., & Todd, P. M. (2015).

Hidden processes in structural representations: A

reply to Abbott, Austerweil, and Griffiths

(2015). Psychological Review, 122(3), 570–574.

doi:10.1037/a0039248

Jones, M. N., Kintsch, W., & Mewhort, D. J. (2006).

High-dimensional semantic space accounts of

priming. Journal of Memory and Language, 55(4),

534-552.

Jones, M. N., Willits, J., & Dennis, S. (2015). Models

of semantic memory. Oxford Handbook of

Mathematical and Computational Psychology, 232-

254.

Jones, M. N., & Mewhort, D. J. (2007). Representing

word meaning and order information in a composite

holographic lexicon. Psychological Review, 114(1),

Jones, M., & Recchia, G. (2010). You can’t wear a

coat rack: A binding framework to avoid illusory

feature migrations in perceptually grounded

semantic models. In Proceedings of the Annual

Meeting of the Cognitive Science Society (Vol. 32,

No. 32).

Jones, M. N., Willits, J., Dennis, S., & Jones, M.

(2015). Models of semantic memory. Oxford

Handbook of Mathematical and Computational

Psychology, 232-254.

Kalchbrenner, N., Grefenstette, E., & Blunsom, P.

(2014). A convolutional neural network for

modelling sentences. arXiv preprint

arXiv:1404.2188.

Kalénine, S., Mirman, D., Middleton, E. L., &

Buxbaum, L. J. (2012). Temporal dynamics of

activation of thematic and functional knowledge

during conceptual processing of manipulable

artifacts. Journal of Experimental Psychology:

Learning, Memory, and Cognition, 38(5), 1274.

Kanerva, P. (2009). Hyperdimensional computing: An

introduction to computing in distributed

representations with high-dimensional random

vectors. Cognitive Computation, 1, 139-159.

Kenett, Y. N., Levi, E., Anaki, D., & Faust, M. (2017).

The semantic distance task: Quantifying semantic

distance with semantic network path length. Journal

of Experimental Psychology: Learning, Memory,

and Cognition, 43(9), 1470.

Kiela, D., & Bottou, L. (2014, October). Learning

image embeddings using convolutional neural

networks for improved multi-modal semantics.

In Proceedings of the 2014 Conference on

Empirical Methods in Natural Language

Processing (EMNLP) (pp. 36-45).

Kiela, D., Bulat, L., & Clark, S. (2015). Grounding

semantics in olfactory perception. In Proceedings of

the 53rd Annual Meeting of the Association for

Computational Linguistics and the 7th International

Joint Conference on Natural Language Processing

(Volume 2: Short Papers) (pp. 231–236). Beijing,

China: ACL.

Kiela, D., & Clark, S. (2015). Multi-and cross-modal

semantics beyond vision: Grounding in auditory

perception. In Proceedings of the 2015 Conference

on Empirical Methods in Natural Language

Processing (EMNLP 2015) (pp. 2461–2470).

Lisbon, Portugal: ACL.

Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016,

March). Character-aware neural language models.

In Thirtieth AAAI Conference on Artificial

Intelligence. Retrieved from

https://www.aaai.org/ocs/index.php/AAAI/AAAI16

/paper/viewPaper/12489.

Kim, A., Ruzmaykin, M., Truong, A., & Summerville,

A. (2019, October). Cooperation and Codenames:

Understanding Natural Language Processing via

Codenames. In Proceedings of the AAAI Conference

on Artificial Intelligence and Interactive Digital

Entertainment (Vol. 15, No. 1, pp. 160-166).

Kintsch, W. (2001). Predication. Cognitive

Science, 25(2), 173-202.

Kousta, S. T., Vigliocco, G., Vinson, D. P., Andrews,

M., & Del Campo, E. (2011). The representation of

abstract words: why emotion matters. Journal of

Experimental Psychology: General, 140(1), 14.

Kumar, A. A., Balota, D. A., Habbert, J., Scaltritti, M.,

& Maddox, G. B. (2019). Converging semantic and

phonological information in lexical retrieval and

selection in young and older adults. Journal of

Experimental Psychology: Learning, Memory, and

Cognition. 45(12), 2267–2289.

https://doi.org/10.1037/xlm0000699

Kumar, A.A., Balota, D.A., Steyvers, M. (2019).

Distant Concept Connectivity in Network-Based

and Spatial Word Representations. In Proceedings

of the 41st Annual Meeting of the Cognitive Science

Society, 1348-1354.

Kumar, A.A., Steyvers, M., & Balota, D.A. (in prep.).

Investigating Semantic Memory Retrieval in a

Cooperative Word Game.

Kurach, K., Raichuk, A., Stańczyk, P., Zając, M.,

Bachem, O., Espeholt, L., ... & Gelly, S. (2019).

Google research football: A novel reinforcement

learning environment. arXiv preprint

arXiv:1907.11180. Retrieved from

https://arxiv.org/abs/1907.11180.

Kutas, M., & Hillyard, S. A. (1980). Event-related

brain potentials to semantically inappropriate and

surprisingly large words. Biological

Psychology, 11(2), 99-116.

Kwantes, P. J. (2005). Using context to build

semantics. Psychonomic Bulletin & Review, 12(4),

703-710.

Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B.

(2015). Human-level concept learning through

probabilistic program

induction. Science, 350(6266), 1332-1338.

Lake, B. M., Ullman, T. D., Tenenbaum, J. B., &

Gershman, S. J. (2017). Building machines that

learn and think like people. Behavioral and Brain

sciences, 40.

Lakoff, G., & Johnson, M. (1999). Philosophy in the

Flesh (Vol. 4). New York: Basic books.

Lample, G., Conneau, A., Ranzato, M. A., Denoyer,

L., & Jégou, H. (2018). Word translation without

parallel data. In International Conference on

Learning Representations. Retrieved from

https://openreview.net/forum?id=H196sainb.

Landau, B., Smith, L. B., & Jones, S. S. (1988). The

importance of shape in early lexical

learning. Cognitive Development, 3(3), 299-321.

Landauer, T. K. (2001). Single representations of

multiple meanings in latent semantic analysis.

Landauer, T. K., & Dumais, S. T. (1997). A solution to

Plato's problem: The latent semantic analysis theory

of acquisition, induction, and representation of

knowledge. Psychological Review, 104(2), 211.

Landauer, T. K., Laham, D., Rehder, B., & Schreiner,

M. E. (1997). How well can passage meaning be

derived without using word order? A comparison of

Latent Semantic Analysis and humans.

In Proceedings of the 19th annual meeting of the

Cognitive Science Society (pp. 412-417).

Lazaridou, A., Pham, N. T., & Baroni, M. (2015).

Combining language and vision with a multimodal

skip-gram model. arXiv preprint arXiv:1501.02598.

Lazaridou, A., Marelli, M., & Baroni, M. (2017).

Multimodal word meaning induction from minimal

exposure to natural text. Cognitive Science, 41, 677-

705.

Lebois, L. A., Wilson‐Mendenhall, C. D., & Barsalou,

L. W. (2015). Are automatic conceptual cores the

gold standard of semantic processing? The context‐

dependence of spatial meaning in grounded

congruency effects. Cognitive Science, 39(8), 1764-

1801.

Lee, C. L., Middleton, E., Mirman, D., Kalénine, S.,

& Buxbaum, L. J. (2013). Incidental and context-

responsive activation of structure-and function-

based action features during object

identification. Journal of Experimental Psychology:

Human Perception and Performance, 39(1), 257.

Levesque, H., Davis, E., & Morgenstern, L. (2012,

May). The winograd schema challenge.

In Thirteenth International Conference on the

Principles of Knowledge Representation and

Reasoning. Retrieved from

https://www.aaai.org/ocs/index.php/KR/KR12/pape

r/viewPaper/4492.

Levy, O., & Goldberg, Y. (2014). Neural word

embedding as implicit matrix factorization.

In Advances in neural information processing

systems (pp. 2177-2185).

Levy, O., Goldberg, Y., & Dagan, I. (2015).

Improving distributional similarity with lessons

learned from word embeddings. Transacti ons of the

Association for Computational Linguistics, 3, 211-

225.

Li, P., Burgess, C., & Lund, K. (2000). The

acquisition of word meaning through global lexical

co-occurrences. In Proceedings of the Thirtieth

Annual Child Language Research Forum, 166-178.

Li, J., & Jurafsky, D. (2015). Do multi-sense

embeddings improve natural language

understanding?. arXiv preprint arXiv:1506.01070.

Li, J., Monroe, W., Ritter, A., Galley, M., Gao, J., &

Jurafsky, D. (2016). Deep reinforcement learning

for dialogue generation. arXiv preprint

arXiv:1606.01541. Retrieved from

https://arxiv.org/abs/1606.01541.

Li, J., Monroe, W., Shi, T., Jean, S., Ritter, A., &

Jurafsky, D. (2017). Adversarial learning for neural

dialogue generation. arXiv preprint

arXiv:1701.06547. Retrieved from

https://arxiv.org/abs/1701.06547.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen,

D., ... & Stoyanov, V. (2019). Roberta: A robustly

optimized bert pretraining approach. arXiv preprint

arXiv:1907.11692. Retrieved from

https://arxiv.org/abs/1907.11692.

Livesay, K., & Burgess, C. (1998). Mediated priming

in high-dimensional semantic space: No effect of

direct semantic relationships or co-

occurrence. Brain and Cognition, 37(1), 102–105.

Loaiza, V. M., & Camos, V. (2018). The role of

semantic representations in verbal working

memory. Journal of Experimental Psychology:

Learning, Memory, and Cognition, 44(6), 863.

Lopopolo, A., & Miltenburg, E. (2015). Sound-based

distributional models. In Proceedings of the 11th

International Conference on Computational

Semantics (pp. 70–75).

Louwerse, M. M. (2011). Symbol interdependency in

symbolic and embodied cognition. Top i cs i n

Cognitive Science, 3(2), 273-302.

Lucas, M. (2000). Semantic priming without

association: A meta-analytic review. Psychonomic

Bulletin & Review, 7(4), 618-630.

Lucy, L., & Gauthier, J. (2017). Are distributional

representations ready for the real world? Evaluating

word vectors for grounded perceptual

meaning. arXiv preprint arXiv:1705.11168.

Lund, K., & Burgess, C. (1996). Producing high-

dimensional semantic spaces from lexical co-

occurrence. Behavior Research Methods,

Instruments, & Computers, 28(2), 203-208.

Luong, T., Socher, R., & Manning, C. (2013, August).

Better word representations with recursive neural

networks for morphology. In Proceedings of the

Seventeenth Conference on Computational Natural

Language Learning (pp. 104-113). Retrieved from

https://www.aclweb.org/anthology/W13-3512/.

Lupker, S. J. (1984). Semantic priming without

association: A second look. Journal of Verbal

Learning and Verbal Behavior, 23, 709-733.

Mandera, P., Keuleers, E., & Brysbaert, M. (2017).

Explaining human performance in psycholinguistic

tasks with models of semantic similarity based on

prediction and counting: A review and empirical

validation. Journal of Memory and Language, 92,

57-78.

Masson, M. E. (1995). A distributed memory model of

semantic priming. Journal of Experimental

Psychology: Learning, Memory, and

Cognition, 21(1), 3.

Matheson, H. E., & Barsalou, L. W. (2018).

Embodiment and grounding in cognitive

neuroscience. Stevens' Handbook of Experimental

Psychology and Cognitive Neuroscience, 3, 1-27.

Matheson, H., White, N., & McMullen, P. (2015).

Accessing embodied object representations from

vision: A re view. Psychological Bulletin, 141(3),

511.

Mayford, M., Siegelbaum, S. A., & Kandel, E. R.

(2012). Synapses and memory storage. Cold Spring

Harbor Perspectives in Biology, 4(6), a005751.

McCann, B., Keskar, N. S., Xiong, C., & Socher, R.

(2018). The natural language decathlon: Multitask

learning as question answering. arXiv preprint

arXiv:1806.08730. Retrieved from

https://arxiv.org/abs/1806.08730.

McClelland, J. L., Hill, F., Rudolph, M., Baldridge, J.,

& Schütze, H. (2019). Extending Machine

Language Models toward Human-Level Language

Understanding. arXiv preprint arXiv:1912.05877.

Retrieved from https://arxiv.org/abs/1912.05877.

McClelland, J. L., & Thompson, R. M. (2007). Using

domain‐general principles to explain children's

causal reasoning abilities. Developmental

Science, 10(3), 333-356.

McKoon, G., & Ratcliff, R. (1992). Spreading

activation versus compound cue accounts of

priming: Mediated priming revisited. Journal of

Experimental Psychology: Learning, Memory, and

Cognition, 18(6), 1155.

McKoon, G., Ratcliff, R., & Dell, G. S. (1986). A

critical evaluation of the semantic-episodic

distinction. Journal of Experimental Psychology:

Learning, Memory, and Cognition, 12(2), 295–

306. https://doi.org/10.1037/0278-7393.12.2.295

McNamara, T. P., & Altarriba, J. (1988). Depth of

spreading activation revisited: Semantic mediated

priming occurs in lexical decisions. Journal of

Memory and Language, 27(5), 545-559.

McRae, K. (2004). Semantic memory: Some insights

from feature-based connectionist attractor

networks. The Psychology of Learning and

Motivation: Advances in Research and Theory, 45,

41-86.

McRae, K., Cree, G. S., Seidenberg, M. S., &

McNorgan, C. (2005). Semantic feature production

norms for a large set of living and nonliving

things. Behavior Research Methods, 37(4), 547-

559.

McRae, K., De Sa, V. R., & Seidenberg, M. S. (1997).

On the nature and scope of featural representations

of word meaning. Journal of Experimental

Psychology: General, 126(2), 99.

McRae, K., Khalkhali, S., & Hare, M. (2012).

Semantic and associative relations: Examining a

tenuous dichotomy. In V. F. Reyna, S. B. Chapman,

M. R. Dougherty, & J. Confrey (Eds.), The

Adolescent Brain: Learning, Reasoning, and

Decision Making (pp. 39-66). Washington, DC:

APA.

Metusalem, R., Kutas, M., Urbach, T. P., Hare, M.,

McRae, K., & Elman, J. L. (2012). Generalized

event knowledge activation during online sentence

comprehension. Journal of Memory and

Language, 66(4), 545-567.

Meyer, D. E., & Schvaneveldt, R. W. (1971).

Facilitation in recognizing pairs of words: evidence

of a dependence between retrieval

operations. Journal of Experimental

Psychology, 90(2), 227.

Miller, G.A. (1995).WordNet: An online lexical

database [Special Issue]. International Journal of

Lexicography, 3(4).

Mitchell, J., & Lapata, M. (2010). Composition in

distributional models of semantics. Cognitive

Science, 34(8), 1388-1429.

Mikolov, T., Chen, K., Corrado, G., & Dean, J.

(2013a). Efficient estimation of word

representations in vector space. arXiv preprint

arXiv:1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., &

Dean, J. (2013b). Distributed representations of

words and phrases and their compositionality.

In Advances in Neural Information Processing

Systems (pp. 3111-3119).

Morgenstern, L., Davis, E., & Ortiz, C. L. (2016).

Planning, executing, and evaluating the winograd

schema challenge. AI Magazine, 37(1), 50-54.

Morris, R. K. (1994). Lexical and message-level

sentence context effects on fixation times in

reading. Journal of Experimental Psychology:

Learning, Memory, and Cognition, 20(1), 92.

Neely, J. H. (1977). Semantic priming and retrieval

from lexical memory: Roles of inhibitionless

spreading activation and limited-capacity

attention. Journal of Experimental Psychology:

General, 106(3), 226.

Neely, J. H. (2012). Semantic priming effects in visual

word recognition: A selective review of current

findings and theories. In Basic processes in

reading (pp. 272-344). Routledge.

Neisser, U. 1976. Cognition and Reality. San

Francisco: W.H. Freeman and Co.

Nelson, D. L., McEvoy, C. L., & Schreiber, T. A.

(2004). The University of South Florida free

association, rhyme, and word fragment

norms. Behavior Research Methods, Instruments, &

Computers, 36(3), 402-407.

Nematzadeh, A., Miscevic, F., & Stevenson, S.

(2016). Simple search algorithms on semantic

networks learned from language use. arXiv preprint

arXiv:1602.03265. Retrieved from

https://arxiv.org/pdf/1602.03265.pdf.

Niven, T., & Kao, H. Y. (2019). Probing neural

network comprehension of natural language

arguments. arXiv preprint arXiv:1907.07355.

Retrieved from

https://arxiv.org/pdf/1907.07355.pdf.

Nosofsky, R. M. (1988). Exemplar-based accounts of

relations between classification, recognition, and

typicality. Journal of Experimental Psychology:

Learning, Memory, and Cognition, 14(4), 700.

Nosofsky, R. M. (1991). Tests of an exemplar model

for relating perceptual classification and recognition

memory. Journal of Experimental Psychology:

Human Perception and Performance, 17, 3–27.

Nosofsky, R. M., & Zaki, S. R. (2003). A hybrid-

similarity exemplar model for predicting

distinctiveness effects in perceptual old-new

recognition. Journal of Experimental Psychology:

Learning, Memory, and Cognition, 29(6), 1194.

Nozari, N., Trueswell, J. C., & Thompson-Schill, S. L.

(2016). The interplay of local attraction, context and

domain-general cognitive control in activation and

suppression of semantic distractors during sentence

comprehension. Psychonomic Bulletin &

Review, 23(6), 1942-1953.

O’Kane, G., Kensinger, E. A., & Corkin, S. (2004).

Evidence for semantic learning in profound

amnesia: an investigation with patient

HM. Hippocampus, 14(4), 417-425.

Olah, C. (2019, May 20). Understanding LSTM

Networks. Colah’s Blog. Retrieved from

https://colah.github.io/posts/2015-08-

Understanding-LSTMs/

Olney, A. M. (2011). Large-scale latent semantic

analysis. Behavior Research Methods, 43(2), 414-

423.

OpenAI (2019). Dota 2 with Large Scale Deep

Reinforcement Learning. Retrieved from

https://arxiv.org/abs/1912.06680.

Osgood, C. E. (1952). The nature and measurement of

meaning. Psychological Bulletin, 49(3), 197.

Osgood, C. E., Suci, G. J., & Tannenbaum, P. H.

(1957). The Measurement of Meaning (No. 47).

University of Illinois Press.

Pacht, J. M., & Rayner, K. (1993). The processing of

homophonic homographs during reading: Evidence

from eye movement studies. Journal of

Psycholinguistic Research, 22(2), 251-271.

Paivio, A. (1991). Dual coding theory: Retrospect and

current status. Canadian Journal of

Psychology/Revue canadienne de

psychologie, 45(3), 255.

Patterson, K., Nestor, P. J., & Rogers, T. T. (20 07 ).

Where do you know what you know? The

representation of semantic knowledge in the human

brain. Nature Reviews Neuroscience, 8(12), 976.

Pennington, J., Socher, R., & Manning, C. (2014).

Glove: Global vectors for word representation.

In Proceedings of the 2014 Conference on

Empirical Methods in Natural Language

Processing (EMNLP) (pp. 1532-1543).

Perfetti, C. (1998). The limits of co-occurrence: Tools

and theories in language research. Discourse

Processes, 25, 363-377.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M.,

Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep

contextualized word representations. arXiv preprint

arXiv:1802.05365.

Pezzulo, G., & Calvi, G. (2011). Computational

explorations of perceptual symbol systems

theory. New Ideas in Psychology, 29(3), 275-297.

Pinker, S. (2003). Language as an adaptation to the

cognitive niche. Studies in the Evolution of

Language, 3, 16-37.

Pirrone, A., Marshall, J. A., & Stafford, T. (2017). A

Drift Diffusion Model account of the semantic

congruity effect in a classification

paradigm. Journal of Numerical Cognition, 3(1),

77-96.

Plaut, D. C., & Booth, J. R. (2000). Individual and

developmental differences in semantic priming:

empirical and computational support for a single-

mechanism account of lexical

processing. Psychological Review, 107(4), 786.

Plaut, D. C., & Shallice, T. (1993). Deep dyslexia: A

case study of connectionist

neuropsychology. Cognitive

Neuropsychology, 10(5), 377-500.

Poirier, M., Saint-Aubin, J., Mair, A., Tehan, G., &

Tol an , A. ( 20 15) . Orde r reca ll in v erba l sho rt -term

memory: The role of semantic networks. Memory &

Cognition, 43(3), 489-499.

Posner, M. I., & Snyder, C. R. R. (1975) Attention and

cognitive control. In: Solso R (ed.) Information

Processing and Cognition: The Loyola Symposium,

pp. 55–85. Hillsdale, NJ: Erlbaum.

Posner, M. I., & Keele, S. W. (1968). On the genesis

of abstract ideas. Journal of Experimental

Psychology, 77(3p1), 353.

Pulvermüller, F. (2005). Brain mechanisms linking

language and action. Nature Reviews

Neuroscience, 6(7), 576.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D.,

& Sutskever, I. (2019). Language models are

unsupervised multitask learners. OpenAI Blog, 1(8).

Retrieved from https://www.techbooky.com/wp-

content/uploads/2019/02/Better-Language-Models-

and-Their-Implications.pdf.

Ratcliff, R., & McKoon, G. (2008). The diffusion

decision model: theory and data for two-choice

decision tasks. Neural Computation, 20(4), 873-

922.

Rayner, K., Cook, A. E., Juhasz, B. J., & Frazier, L.

(2006). Immediate disambiguation of lexically

ambiguous words during reading: Evidence from

eye movements. British Journal of

Psychology, 97(4), 467-482.

Rayner, K., & Frazier, L. (1989). Selection

mechanisms in reading lexically ambiguous

words. Journal of Experimental Psychology:

Learning, Memory, and Cognition, 15(5), 779.

Recchia, G., & Jones, M. N. (2009). More data trumps

smarter algorithms: Comparing pointwise mutual

information with latent semantic analysis. Behavior

Research Methods, 41(3), 647-656.

Recchia, G., & Nulty, P. (2017). Improving a

Fundamental Measure of Lexical Association.

In CogSci.

Reilly, J. (2016). How to constrain and maintain a

lexicon for the treatment of progressive semantic

naming deficits: principles of item selection for

formal semantic therapy. Neuropsychological

Rehabilitation, 26(1), 126-156.

Reisinger, J., & Mooney, R. J. (2010, June). Multi-

prototype vector-space models of word meaning.

In Human Language Technologies: The 2010

Annual Conference of the North American Chapter

of the Association for Computational

Linguistics (pp. 109-117). Association for

Computational Linguistics.

Rescorla, R. A. (1988). Behavioral studies of

Pavlovian conditioning. Annual Review of

Neuroscience, 11(1), 329-352.

Rescorla, R. A., & Wagner, A. R. (1972). A theory of

Pavlovian conditioning: Variations in the

effectiveness of reinforcement and

nonreinforcement. Classical Conditioning II:

Current Research and Theory, 2, 64-99.

Reynolds, J. R., Zacks, J. M., & Braver, T. S. (2007).

A computa tional model of event segmenta tion from

perceptual prediction. Cognitive Science, 31(4),

613-643.

Riordan, B., & Jones, M. N. (2011). Redundancy in

perceptual and linguistic experience: Comparing

feature‐based and distributional models of semantic

representation. Top ic s i n Co gn it iv e S ci e nc e, 3(2),

303-345.

Richie, R., White, B., Bhatia, S., & Hout, M. C.

(2019). The spatial arrangement method of

measuring similarity can capture high-dimensional,

semantic structures. Retrieved from

https://psyarxiv.com/qm27p.

Richie, R., Zou, W., & Bhatia, S. (2019). Predicting

high-level human judgment across diverse

behavioral domains. Collabra: Psychology, 5(1).

Roediger, H. L., & McDermott, K. B. (1995).

Creating false memories: Remembering words not

presented in lists. Journal of Experimental

Psychology: Learning, Memory, and

Cognition, 21(4), 803.

Rogers, T. T., Lambon Ralph, M. A., Garrard, P.,

Bozeat, S., McClelland, J. L., Hodges, J. R., &

Patterson, K. (2004). Structure and deterioration of

semantic memory: a neuropsychological and

computational investigation. Psychological

Review, 111(1), 205.

Rogers, T. T., & Wolmetz, M. (2016). Conceptual

knowledge representation: A cross-section of

current research. Cognitive Neuropsychology, 33(3-

4), 121-129.

Roget, P. M. (1911). Roget’s Thesaurus of English

Words and Ph ra se s (1 911 ed.). Re tr ie ve d Oc to be r

28, 2004, from

http://www.gutenberg.org/etext/10681

Rosch, E., & Lloyd, B. B. (Eds.). (1978). Cognition

and categorization.

Rosch, E., & Mervis, C. B. (1975). Family

resemblances: Studies in the internal structure of

categories. Cognitive Psychology, 7(4), 573-605.

Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D.

M., & Boyes-Braem, P. (1976). Basic objects in

natural categories. Cognitive Psychology, 8(3), 382-

439.

Rotaru, A. S., Vigliocco, G., & Frank, S. L. (2018).

Modeling the Structure and Dynamics of Semantic

Processing. Cognitive Science, 42(8), 2890-2917.

Rubinstein, D., Levi, E., Schwartz, R., & Rappoport,

A. (2015, July). How well do distributional models

capture different types of semantic knowledge?.

In Proceedings of the 53rd Annual Meeting of the

Association for Computational Linguistics and the

7th International Joint Conference on Natural

Language Processing (Volume 2: Short Papers,

726-730.

Rumelhart, D. E. (1991). Understanding

understanding. Memories, thoughts and emotions:

Essays in honor of George Mandler, 257, 275.

Rumelhart, D. E., Hinton, G. E., & McClelland, J. L.

(1986). A general framework for parallel distributed

processing. Parallel Distributed Processing:

Explorations in the Microstructure of

cognition, 1(45-76), 26.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J.

(1988). Learning representations by back-

propagating errors. Cognitive Modeling, 5(3), 1.

Rumelhart, D. E., & Todd, P. M. (1993). Learning and

connectionist representations. Attention and

Performance XIV: Synergies in Experimental

Psychology, Artificial Intelligence, and Cognitive

Neuroscience, 3-30.

Sahlgren, M. (2008). The distributional

hypothesis. Italian Journal of Disability Studies, 20,

33-53.

Sahlgren, M., Holst, A., & Kanerva, P. (2008).

Permutations as a means to encode order in word

space. Proceedings of the 30th Conference of the

Cognitive Science Society, p. 1300-1305.

Saluja, A., Dyer, C., & Ruvini, J. D. (2018).

Paraphrase-Supervised Models of

Compositionality. arXiv preprint arXiv:1801.10293.

Schank, R. C., & Abelson, R. P. (1977).

Scripts. Plans, Goals and Understanding.

Schapiro, A. C., Rogers, T. T., Cordova, N. I., Turk-

Browne, N. B., & Botvinick, M. M. (2013). Neural

representations of events arise from temporal

community structure. Nature Neuroscience, 16(4),

486.

Schneider, T. R., Debener, S., Oostenveld, R., &

Engel, A. K. (2008). Enhanced EEG gamma-band

activity reflects multisensory semantic matching in

visual-to-auditory object

priming. Neuroimage, 42(3), 1244-1254.

Searle, J. R. (1980). Minds, brains, and

programs. Behavioral and Brain Sciences, 3(3),

417-424.

Shallice, T. (1988). Specialisation within the semantic

system. Cognitive Neuropsychology, 5(1), 133-142.

Shen, J. H., Hofer, M., Felbo, B., & Levy, R. (2018).

Comparing Models of Associative Meaning: An

Empirical Investigation of Reference in Simple

Language Games. arXiv preprint

arXiv:1810.03717.

Siew, C. S., Wulff, D. U., Beckage, N. M., & Kenett,

Y. N . ( 2 0 1 8 ) . C o g n i t i v e N e t w o r k S c i e n c e : A r e v i e w

of research on cognition through the lens of

network representations, processes, and dynamics.

Complexity.

Silberer, C., & Lapata, M. (2012, July). Grounded

models of semantic representation. In Proceedings

of the 2012 Joint Conference on Empirical Methods

in Natural Language Processing and

Computational Natural Language Learning (pp.

1423-1433). Association for Computational

Linguistics.

Silberer, C., & Lapata, M. (2014, June). Learning

grounded meaning representations with

autoencoders. In Proceedings of the 52nd Annual

Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers) (pp. 721-732).

Sivic, J., & Zisserman, A. (2003, October). Video

Google: A text retrieval approach to object

matching in videos. In Proceedings of the Ninth

IEEE International Conference on Computer

Visi o n .

Sloutsky, V. M., Yim, H., Yao, X., & Dennis, S.

(2017). An associative account of the development

of word learning. Cognitive Psychology, 97, 1-30.

Smith, E. E., Shoben, E. J., & Rips, L. J. (1974).

Structure and process in semantic memory: A

featural model for semantic

decisions. Psychological Review, 81(3), 214.

Socher, R., Huval, B., Manning, C. D., & Ng, A. Y.

(2012, July). Semantic compositionality through

recursive matrix-vector spaces. In Proceedings of

the 2012 joint conference on empirical methods in

natural language processing and computational

natural language learning (pp. 1201-1211).

Association for Computational Linguistics.

Socher, R., Perelygin, A., Wu, J., Chuang, J.,

Manning, C. D., Ng, A., & Potts, C. (2013).

Recursive deep models for semantic

compositionality over a sentiment treebank.

In Proceedings of the 2013 Conference on

Empirical methods in Natural Language

Processing (pp. 1631-1642).

Spranger, M., Pauw, S., Loetzsch, M., & Steels, L.

(2012). Open-ended procedural semantics. In L.

Steels & M. Hild (Eds.), Language grounding in

robots (pp. 153–172). Berlin, Heidelberg, Germany:

Springer.

Stanton, R. D., Nosofsky, R. M., & Zaki, S. R. (2002).

Comparisons between exemplar similarity and

mixed prototype models using a linearly separable

category structure. Memory & Cognition, 30(6),

934-944.

Stella, M., Beckage, N. M., & Brede, M. (2017).

Multiplex lexical networks reveal patterns in early

word acquisition in children. Scientific Reports, 7,

46730.

Stella, M., Beckage, N. M., Brede, M., & De

Domenico, M. (2018). Multiplex model of mental

lexicon reveals explosive learning in

humans. Scientific Reports, 8(1), 2259.

Steyvers, M., & Tenenbaum, J. B. (2005). The large‐

scale structure of semantic networks: Statistical

analyses and a model of semantic growth. Cognitive

Science, 29(1), 41-78.

Sutton, R. and Barto, A. (1998). Reinforcement

learning: An introduction. Cambridge, MA, MIT

Press.

Swinney, D. A. (1979). Lexical access during sentence

comprehension:(Re) consideration of context

effects. Journal of Verbal Learning and Verbal

Behavior, 18(6), 645-659.

Tab os si, P., C olom bo, L. , & Jo b, R. ( 1987 ). Acces sing

lexical ambiguity: Effects of context and

dominance. Psychological Research, 49(2-3), 161-

167.

Thompson-Schill, S. L. (2003). Neuroimaging studies

of semantic memory: inferring “how” from

“where”. Neuropsychologia, 41(3), 280-292.

Thompson-Schill, S. L., Kurtz, K. J., & Gabrieli, J. D.

E. (1998). Effects of semantic and associative

relatedness on automatic priming. Journal of

Memory and Language, 38, 440-458.

Tulving, E. (1972). Episodic and semantic

memory. Organization of Memory, 1, 381-403.

Turney, P. D., & Pantel, P. (2010). From frequency to

meaning: Vector space models of

semantics. Journal of Artificial Intelligence

Research, 37, 141-188.

Tversky, A. (1977). Features of similarity.

Psychological Review, 84(4), 327.

Tversky, A., & Gati, I. (1982). Similarity, separability,

and the triangle inequality. Psychological

Review, 89(2), 123.

Upadhyay, S., Chang, K. W., Taddy, M., Kalai, A., &

Zou, J. (2017). Beyond bilingual: Multi-sense word

embeddings using multilingual context. arXiv

preprint arXiv:1706.08160.

Va s wa n i , A . , Sh a z e e r, N ., P a r ma r , N. , U s zk o r e i t, J . ,

Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017).

Attention is all you need. In Advances in neural

information processing systems (pp. 5998-6008).

Vig lio cco , G ., Kou sta , S . T., D ell a Rosa, P. A., Vinson,

D. P., Tettamanti, M., Devlin, J. T., & Cappa, S. F.

(2013). The neural representation of abstract words:

the role of emotion. Cerebral Cortex, 24(7), 1767-

1777.

Vig lio cco , G ., Met eya rd, L. , An dre ws, M. , & Ko ust a,

S. (2009). Toward a theory of semantic

representation. Language and Cognition, 1(2), 219-

247.

Vig lio cco , G ., Vins on, D. P., L ewi s, W., & Gar ret t, M.

F. (200 4) . Re pr es en ti ng t he m eanings of o bj ec t an d

action words: The featural and unitary semantic

space hypothesis. Cognitive Psychology, 48(4),

422-488.

Vit evi tch , M . S ., Cha n, K. Y., & G old ste in, R. (2 014 ).

Insights into failed lexical retrieval from network

science. Cognitive Psychology, 68, 1-32.

Wang, S. I ., L ia ng , P., & M an ni ng , C. D . (2 01 6) .

Learning language games through interaction. arXiv

preprint arXiv:1606.02447.

Wang, A ., S in gh , A., M ic ha el , J. , Hi ll , F., Levy, O ., &

Bowman, S. R. (2018). Glue: A multi-task

benchmark and analysis platform for natural

language understanding. arXiv preprint

arXiv:1804.07461. Retrieved from

https://arxiv.org/abs/1804.07461.

Warstadt , A ., S in gh , A. , & Bo wm an , S. R . (2 01 8) .

Neural network acceptability judgments. arXiv

preprint arXiv:1805.12471. Retrieved from

https://arxiv.org/abs/1805.12471.

Watts, D . J. , & St ro ga tz , S. H . (1998). Coll ec ti ve

dynamics of ‘small-world’ networks. Nature,

393(6684), 440.

Widdows, D. (2008). Semantic Vector Products: Some

Initial Investigations. In Proceedings of the Second

AAAI Symposium on Quantum Interaction.

Retrieved from

https://research.google/pubs/pub33477/.

Willems, R. M., Labruna, L., D’Esposito, M., Ivry, R.,

& Casasanto, D. (2011). A functional role for the

motor system in language understanding: evidence

from theta-burst transcranial magnetic

stimulation. Psychological Science, 22(7), 849-854.

Wittgenstein, Ludwig (1953). Philosophical

Investigations. Blackwell Publishing.

Wulff, D. U., Hills, T., & Mata, R. (2018). Structural

differences in the semantic networks of younger and

older adults. Retrieved from

https://psyarxiv.com/s73dp/.

Xu, F., & Tenenbaum, J. B. (2007). Word learning as

Bayesian inference. Psychological review, 114(2),

245.

Ye e, E. , A h me d , S . Z . , & T h om p so n -Schill, S. L.

(2012). Colorless green ideas (can) prime

furiously. Psychological Science, 23(4), 364-369.

Ye e, E. , Ch r ys i k ou , E . G . , H o ff m an , E . , & T h o mp s on -

Schill, S. L. (2013). Manual experience shapes

object representations. Psychological

Science, 24(6), 909-919.

Ye e, E. , Hu f fs t et l e r, S . , & T h o mp s on -Schill, S. L.

(2011). Function follows form: Activation of shape

and function features during object

identification. Journal of Experimental Psychology:

General, 140(3), 348.

Ye e, E. , Jo n es , M. N. , & M cR a e, K. ( 20 1 8) . S e ma n t ic

memory. Stevens' Handbook of Experimental

Psychology and Cognitive Neuroscience, 3, 1-38.

Ye e, E. , La h ir i , A ., & K ot z o r, S . (2 0 17 ) . F l ui d

semantics: Semantic knowledge is experience-based

and dynamic. The Speech Processing Lexicon:

Neurocognitive and Behavioural Approaches, 22,

236.

Ye e, E. , & Thompson-Schill, S. L. (2016). Putting

concepts into context. Psychonomic Bulletin &

Review, 23(4), 1015-1027.

Ye ss e na l i na , A ., & C a rd i e, C . ( 20 11 , Ju l y) .

Compositional matrix-space models for sentiment

analysis. In Proceedings of the Conference on

Empirical Methods in Natural Language

Processing (pp. 172-182). Association for

Computational Linguistics.

Zacks, J. M., Kurby, C. A., Eisenberg, M. L., &

Haroutunian, N. (2011). Prediction error associated

with the perceptual segmentation of naturalistic

events. Journal of Cognitive Neuroscience, 23 (12),

4057–4066.

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., &

Choi, Y. (2019). HellaSwag: Can a Machine Really

Finish Your Sentence?. arXiv preprint

arXiv:1905.07830. Retrieved from

https://arxiv.org/pdf/1905.07830.pdf.

Zhu, X., Sobhani, P., & Guo, H. (2016). Dag-

structured long short-term memory for semantic

compositionality. In Proceedings of the 2016

Conference of the North American Chapter of the

Association for Computational Linguistics: Human

Language Technologies (pp. 917-926).

Changes in semantic memory structure support successful problem-solving and analogical transfer

Article

Full-text available

Jun 2024

Creative problem-solving is central in daily life, yet its underlying mechanisms remain elusive. Restructuring (i.e., reorganization of problem-related representations) is considered one problem-solving mechanism and may lead to an abstract problem-related representation facilitating the solving of analogous problems. Here, we used network science methodology to estimate participants' semantic memory networks (SemNets) before and after attempting to solve a riddle. Restructuring was quantified as the difference in SemNets metrics between pre-and post-solving phases. Our results provide initial evidence that problem-related SemNets restructuring may be associated with the successful solving of the riddle and, subsequently, an analogous one. Solution-relevant concepts and semantically remote concepts became more strongly related in solvers. Only changes in semantically remote concepts were instrumental in actively solving the riddle while changes in solution-relevant concepts may reflect a pre-exposure to the solution. In our daily life, we constantly deal with problems, ranging from the most mundane (e.g., what to cook for dinner given the ingredients at our disposal), to professional activities (e.g., how to reorganize our current plans to meet a new deadline), up to major societal challenges (e.g., how to find innovative solutions against global warming). How do we find new solutions to problems? While the ability to solve problems is a critical skill for adapting to new situations and innovating, the mechanisms underlying the problem-solving process remain largely unknown. Among the new problems we face each day, some are well-defined (e.g., playing a jigsaw puzzle). The initial state (i.e., the number of independent pieces) and goal state (i.e., assembling the pieces so it looks like the picture model) are clear, and the solver can apply a set of operations (i.e., interlocking the pieces as a function of their shape) to reach the goal. However, for many of our problems (e.g., organizing work activities during the COVID-19 pandemic), the problem space is ambiguous. No heuristics or existing rules could be applied to transform the initial state into the goal state 1. Such "ill-defined" problems 2 thus require additional mental processes , which have been tightly linked to creative thinking 3-5. Ill-defined problem-solving (or creative problem-solving) is often referred to as insight solving, where the solution comes to mind suddenly and effortlessly, with a "Eureka" phenomenon 6-9. According to the Representational Change Theory 10 , solving such problems involves restructuring the initial problem mental representational space 5,9 , which presumably entails combining elements related to the problem in a new way. In theory, restructuring allows one to change perspective, reframe the problem, or escape its implicitly imposed constraints 11 , leading to creative associations 6,9. For instance, consider the following problem: "A man walks into a bar and asks for a glass of water. The bartender points a shotgun at the man. The man says, 'Thank you,' and walks out" 12. The problem is ill-defined because the path to finding the solution is to be discovered, and the goal state is vague. Solving this problem first requires asking the right question: in which context would a shotgun and a glass of water help somebody? Rather than relying on obvious associations (e.g., a glass of water is related to thirst), solvers must fill the missing link between the relevant elements of the problem (a shotgun induces fear, and fear can be a remedy for hiccups, as can drinking a glass of water). Hence, restructuring the initial representation of a given problem would allow one to see this link and find its solution. A separate field of research suggests that such reorganization of mental representations could be useful, not only for solving a given problem, but also for solving future, different problems that share some structural

Exploring Semanticity for Content and Function Word Distinction in Catalan

Article

Full-text available

May 2024

In the realm of linguistics, the concept of “semanticity” was recently introduced as a novel measure designed to study linguistic networks. In a given text, semanticity is defined as the ratio of the potential number of meanings associated with a word to the number of different words with which it is linguistically linked. This concept provides a quantitative indicator that reflects a word’s semantic complexity and its role in a language. In this pilot study, we applied the semanticity measure to the Catalan language, aiming to investigate its effectiveness in automatically distinguishing content words from function words. For this purpose, the measure of semanticity has been applied to a large corpus of texts written in Catalan. We show that the semanticity of words allows us to classify the word classes existing in Catalan in a simple way so that both the semantic and syntactic capacity of each word within a language can be integrated under this parameter. By means of this semanticity measure, it has been observed that adverbs behave like function words in Catalan. This approach offers a quantitative and objective tool for researchers and linguists to gain insights into the structure and dynamics of languages, contributing to a deeper understanding of their underlying principles. The application of semanticity to Catalan is a promising pilot study, with potential applications in other languages, which will allow progress to be made in the field of theoretical linguistics and contribute to the development of automated linguistic tools.

Probing the content of affective semantic memory following caregiving-related early adversity

Article

Apr 2024
DEVELOPMENTAL SCI

Cognitive science has demonstrated that we construct knowledge about the world by abstracting patterns from routinely encountered experiences and storing them as semantic memories. This preregistered study tested the hypothesis that caregiving‐related early adversities (crEAs) shape affective semantic memories to reflect the content of those adverse interpersonal‐affective experiences. We also tested the hypothesis that because affective semantic memories may continue to evolve in response to later‐occurring positive experiences, child‐perceived attachment security will inform their content. The sample comprised 160 children (ages 6–12 at Visit 1; 87F/73 M), 66% of whom experienced crEAs ( n = 105). At Visit 1, crEA exposure prior to study enrollment was operationalized as parental‐reports endorsing a history of crEAs (abuse/neglect, permanent/significant parent‐child separation); while child‐reports assessed concurrent attachment security. A false memory task was administered online ∼2.5 years later (Visit 2) to probe the content of affective semantic memories–specifically attachment schemas. Results showed that crEA exposure (vs. no exposure) was associated with a higher likelihood of falsely endorsing insecure (vs. secure) schema scenes. Attachment security moderated the association between crEA exposure and insecure schema‐based false recognition. Findings suggest that interpersonal‐affective semantic schemas include representations of parent‐child interactions that may capture the quality of one's own attachment experiences and that these representations shape how children remember attachment‐relevant narrative events. Findings are also consistent with the hypothesis that these affective semantic memories can be modified by later experiences. Moving forward, the approach taken in this study provides a means of operationalizing Bowlby's notion of internal working models within a cognitive neuroscience framework. Research Highlights Affective semantic memories representing insecure schema knowledge ( child needs + needs‐not‐met ) may be more salient, elaborated, and persistent among youths exposed to early caregiving adversity. All youths, irrespective of early caregiving adversity exposure, may possess affective semantic memories that represent knowledge of secure schemas ( child needs + needs‐met ). Establishing secure relationships with parents following early‐occurring caregiving adversity may attenuate the expression of insecure semantic memories, suggesting potential malleability. Affective semantic memories include schema representations of parent‐child interactions that may capture the quality of one's own attachment experiences and shape how youths remember attachment‐relevant events.

Effect of age of first exposure on L2 contextual lexical semantic learning: an ERP investigation

Article

Full-text available

May 2024

Age of first exposure (AoFE) is an important factor that influences the quality of L2 acquisition. This study aims to investigate the AoFE effect on the contextual learning of L2 novel words at the neural level, as measured by the N400 component from event-related potentials (ERPs). Eighty-eight participants were recruited for the experiment of L2 pseudoword learning, which includes a learning session and a testing session. The participants’ EEG data were recorded from the testing session, and the N400 effect was derived from target words that were either congruous or incongruous with the context. The linear mixed model and multiple regression model revealed a positive AoFE effect on the N400 effect in discourses that were designed for testing retrieval of episodic and semantic memory even after accounting for the variance contributed by several confounding factors. In addition to AoFE, the effects of total L2 exposure, L2 proficiency and personality on the L2 novel word learning performance indicated by the N400 effect were also confirmed in the statistical results.

Advances in Semantic Dementia: Neuropsychology, Pathology & Neuroimaging

Article

Jun 2024
AGEING RES REV

What Can Language Models Tell Us About Human Cognition?

Article

Apr 2024

Language models are a rapidly developing field of artificial intelligence with enormous potential to improve our understanding of human cognition. However, many popular language models are cognitively implausible on multiple fronts. For language models to offer plausible insights into human cognitive processing, they should implement a transparent and cognitively plausible learning mechanism, train on a quantity of text that is achievable in a human’s lifetime of language exposure, and not assume to represent all of word meaning. When care is taken to create plausible language models within these constraints, they can be a powerful tool in uncovering the nature and scope of how language shapes semantic knowledge. The distributional relationships between words, which humans represent in memory as linguistic distributional knowledge, allow people to represent and process semantic information flexibly, robustly, and efficiently.

Relationship between Semantic Memory and Social Cognition in Schizophrenia

Preprint

Full-text available

May 2024

This study examines the relationship between semantic memory and social cognition in schizophrenia, addressing how these cognitive domains intersect. Semantic memory, which includes general world knowledge and word meanings, was evaluated using verbal fluency tasks and the Camels and Cactus Test. Social cognition, essential for social interaction, was assessed through emotion recognition (Faces Test) and Theory of Mind (Hinting Task). Participants included 50 individuals with schizophrenia and 30 controls. The schizophrenia group showed significantly lower performance on both semantic memory and social cognition tasks. Notably, strong correlations were found between the Camels and Cactus Test and social cognition measures, suggesting that social cognition deficits in schizophrenia may be linked to semantic memory impairments. Regression analyses highlighted that the Camels and Cactus Test significantly predicted social cognition performance, independent of symptomatology. These findings underscore the interconnectedness of semantic memory and social cognition in schizophrenia, suggesting that semantic memory deficits, particularly in non-categorical associations, play a important role in social cognitive impairments. This study provides new insights into the cognitive underpinnings of schizophrenia, emphasizing the need for further research to explore these relationships and their implications for cognitive models and therapeutic interventions.

The Role of Long-Term Memory in Visual Perception

Chapter

May 2024

There has been a long-standing debate in philosophy and psychology about the role of representation in visual perception. Here, we argue based on evidence from philosophy, psychology, and neuroscience that episodic and schematic memory representations are pivotal to the visual perception of objects and scenes. In the visual perception of objects and scenes, sensory information is initially matched with object and scene templates, or schemas, in long-term memory. The most relevant representations are then selected for encoding in working memory. We furthermore argue that activations of episodic memory representations contribute to the fineness of grain of visual representations. The representational view of visual perception that emerges is what we call the “template tuning view.” According to this view, prior information –specifically, long-term memories – shape the representational content of visual perception. In the final section of the chapter, we argue that unlike representational conceptions of visual perception, naïve and direct realist theories have difficulties accommodating these findings.

Words do not just label concepts: activating superordinate categories through labels, lists, and definitions

Article

May 2024

Achieving Balance Between Innovation and Security in the Cloud With Artificial Intelligence of Things: Semantic Web Control Models

Chapter

Mar 2024

This chapter explores the integration of Semantic Web control models, innovation, and security in cloud computing, especially in the context of AIoT integration. The Semantic Web provides machine-understandable data and offers sophisticated control models that enhance innovation and security in cloud environments. Technologies like RDF, OWL, and SPARQL enable semantic interoperability, while control models focus on access control mechanisms and authentication strategies. The chapter introduces the concept of AIoT, integrating AI with IoT devices and discusses the potential of Semantic Web control models in managing security risks and fostering innovation.

The Acquisition of Word Meaning through Global Lexical Co-occurrences

Conference Paper

Full-text available

Jan 2000

Curt Burgess

Structured Event Memory: A Neuro-Symbolic Model of Event Cognition

Article

Full-text available

Apr 2020

Humans spontaneously organize a continuous experience into discrete events and use the learned structure of these events to generalize and organize memory. We introduce the Structured Event Memory (SEM) model of event cognition, which accounts for human abilities in event segmentation, memory, and generalization. SEM is derived from a probabilistic generative model of event dynamics defined over structured symbolic scenes. By embedding symbolic scene representations in a vector space and parametrizing the scene dynamics in this continuous space, SEM combines the advantages of structured and neural network approaches to high-level cognition. Using probabilistic reasoning over this generative model, SEM can infer event boundaries, learn event schemata, and use event knowledge to reconstruct past experience. We show that SEM can scale up to high-dimensional input spaces, producing human-like event segmentation for naturalistic video data, and accounts for a wide array of memory phenomena. (PsycInfo Database Record (c) 2020 APA, all rights reserved).

Predicting High-Level Human Judgment Across Diverse Behavioral Domains

Article

Full-text available

Oct 2019

Recent advances in machine learning, combined with the increased availability of large natural language datasets, have made it possible to uncover semantic representations that characterize what people know about and associate with a wide range of objects and concepts. In this paper, we examine the power of word embeddings, a popular approach for uncovering semantic representations, for studying high-level human judgment. Word embeddings are typically applied to linguistic and semantic tasks, however we show that word embeddings can be used to predict complex theoretically- and practically- relevant human perceptions and evaluations in domains as diverse as social cognition, health behavior, risk perception, organizational behavior, and marketing. By learning mappings from word embeddings directly onto judgment ratings, we outperform a similarity-based baseline and perform favorably compared to common metrics of human inter-rater reliability. Word embeddings are also able to identify the concepts that are most associated with observed perceptions and evaluations, and can thus shed light on the psychological substrates of judgment. Overall, we provide new methods and insights for predicting and understanding high-level human judgment, with important applications across the social and behavioral sciences.

What Does BERT Learn about the Structure of Language?

Conference Paper

Full-text available

Jan 2019

Cooperation and Codenames: Understanding Natural Language Processing via Codenames

Article

Oct 2019

Codenames – a board game by Vlaada Chvátil – is a game that requires deep, multi-modal language understanding. One player, the codemaster, gives a clue to another set of players, the guessers, and the guessers must determine which of 25 possible words on the board correspond to the clue. The nature of the game requires understanding language in a multi-modal manner – e.g., the clue ‘cold’ could refer to temperature or disease. The recently proposed Codenames AI Competition seeks to advance natural language processing, by using Codenames as a testbed for multi-modal language understanding. In this work, we evaluate a number of different natural language processing techniques (ranging from neural approaches to classical knowledge-base methods) in the context of the Codenames AI framework, attempting to determine how different approaches perform. The agents are evaluated when working with identical agents, as well as evaluated with all other approaches – i.e., when they have no knowledge about their partner.

Finding structure in time

Chapter

Feb 2020

Jeffrey L. Elman

The spatial arrangement method of measuring similarity can capture high-dimensional semantic structures

Article

Feb 2020
BEHAV RES METHODS

Psychologists collect similarity data to study a variety of phenomena including categorization, generalization and discrimination, and representation itself. However, collecting similarity judgments between all pairs of items in a set is expensive, spurring development of techniques like the Spatial Arrangement Method (SpAM; Goldstone, Behavior Research Methods, Instruments, & Computers, 26, 381–386, 1994), wherein participants place items on a two-dimensional plane such that proximity reflects perceived similarity. While SpAM greatly hastens similarity measurement, and has been successfully used for lower-dimensional, perceptual stimuli, its suitability for higher-dimensional, conceptual stimuli is less understood. In study 1, we evaluated the ability of SpAM to capture the semantic structure of eight different categories composed of 20–30 words each. First, SpAM distances correlated strongly (r = .71) with pairwise similarity judgments, although below SpAM and pairwise judgment split-half reliabilities (r’s > .9). Second, a cross-validation exercise with multidimensional scaling fits at increasing latent dimensionalities suggested that aggregated SpAM data favored higher (> 2) dimensional solutions for seven of the eight categories explored here. Third, split-half reliability of SpAM dissimilarities was high (Pearson r = .90), while the average correlation between pairs of participants was low (r = .15), suggesting that when different participants focus on different pairs of stimulus dimensions, reliable high-dimensional aggregate similarity data is recoverable. In study 2, we show that SpAM can recover the Big Five factor space of personality trait adjectives, and that cross-validation favors a four- or five-dimension solution on this dataset. We conclude that SpAM is an accurate and reliable method of measuring similarity for high-dimensional items like words. We publicly release our data for researchers.

HellaSwag: Can a Machine Really Finish Your Sentence?

Conference Paper