PreprintPDF Available

GPT-3: Waterloo or Rubicon? Here be Dragons, Version 4.1

May 2022

May 2022

DOI:10.13140/RG.2.2.18525.03048

Authors:

Johns Hopkins University

Preprints and early-stage research may not have been peer reviewed yet.

GPT-3 is based on distributional semantics. Warren Weaver had the basic idea in his 1949 memorandum, “Translation” (p. 8). Gerard Salton operationalized the idea in his work using vector semantics for document retrieval in the 1960s and 1970s (p. 9). Since then distributional semantics has developed as an empirical discipline. The last decade of work in NLP has seen remarkable, even astonishing, progress. And yet we lack a robust theoretical framework in which we can understand and explain that progress. Such a framework must also indicate the inherent limitations of distributional semantics. This document is a first attempt to outline such a framework, as such its various formulations must be seen as speculative and provisional. I offer them so that others may modify them, replace them, and move beyond them.

Content uploaded by William Benzon

Content may be subject to copyright.

Content uploaded by William Benzon

Content may be subject to copyright.

Content uploaded by William Benzon

Content may be subject to copyright.

Content uploaded by William Benzon

Content may be subject to copyright.

Content uploaded by William Benzon

Content may be subject to copyright.

GPT-3: Waterloo or Rubicon?

Here be Dragons

A Working Paper • Version 4.1 • 5.7.2022

William L. Benzon

2

GPT-3 is a significant achievement.

But I fear the community that has created it may, like other

communities have done before – machine translation in the mid-

1960s, symbolic computing in the mid-1980s, triumphantly walk

over the edge of a cliff and find itself standing proudly in mid-air.

This is not necessary and certainly not inevitable.

A great deal has been written about GPTs and transformers more generally, both in

the technical literature and in commentary of various levels of sophistication. I have

read only a small portion of this. But nothing I have read indicates any interest in the

nature of language or mind. Interest seems relegated to the GPT engine itself. And yet

the product of that engine, a language model, is opaque. I believe that, if we are to

move to a level of accomplishment beyond what has been exhibited to date, we must

understand what that engine is doing so that we may gain control over it. We must

think about the nature of language and of the mind.

That is what this working paper sets out to achieve, a beginning point, and only that.

By attending to ideas by Adam Neubig, Julian Michael, and Sydney Lamb, and by

extending them through the geometric semantics of Peter Gärdenfors, we can create a

framework in which to understand language and mind, a framework that is

commensurate with the operations of GPT-3. That framework can help us to

understand what GPT-3 is doing when it constructs a language model, and thereby to

gain control over that model so we can enhance and extend it.

It is in that speculative spirit that I offer the following remarks.

Bill Benzon, August 20, 2020

About the cover image: I created the background pattern in December, 1985, using MacPaint

running on a 1984 128K Apple Macintosh. I used a current version of Adobe’s Photoshop to overlay the

aleph.

GPT-3: Waterloo or Rubicon? Here be Dragons

William Benzon

Version 4.1, May 7, 2022

Abstract: GPT-3 is an AI engine that generates text in response to a prompt

given to it by a human user. It does not understand the language that it produces,

at least not as philosophers understand such things. And yet its output is in many

cases astonishingly like human language. How is this possible? Think of the mind

as a high-dimensional space of signifieds, that is, meaning-bearing elements.

Correlatively, text consists of one-dimensional strings of signifiers, that is, linguistic

forms. GPT-3 creates a language model by examining the distances and ordering

of signifiers in a collection of text strings and computes over them so as to reverse

engineer the trajectories texts take through that space. Peter Gärdenfors’ semantic

geometry provides a way of thinking about the dimensionality of mental space

and the multiplicity of phenomena in the world, about how mind mirrors the

world. Yet artificial systems are limited by the fact that they do not have a

sensorimotor system that has evolved over millions of years. They do have

inherent limits.

Contents

0. Starting point and preview ........................................................................................................... 1!

1. Computers are strange beasts ....................................................................................................... 4!

2. No meaning, no how: GPT-3 as Rubicon and Waterloo, a personal view .................................. 7!

3. The brain, the mind, and GPT-3: An “isometric transform” onto meaning space ................... 15!

4. Why is simple arithmetic difficult for deep learning systems? .................................................... 20!

5. Metaphysics: The dimensionality of mind and world ................................................................ 22!

6. Gestalt switch: GPT-3 as a model of the mind ........................................................................... 28!

7. Engineered intelligence at liberty in the world ........................................................................... 30!

Appendix: Semanticity, adhesion and relationality ........................................................................ 34!

1301 Washington St., Apt. 311

Hoboken, NJ 07030

646.599.3232

bbenzon@mindspring.com

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.

1

0. Starting point and preview

GPT-3 is based on distributional semantics. Warren Weaver had the basic idea in his 1949

memorandum, “Translation” (p. 7). Gerard Salton operationalized the idea in his work using

vector semantics for document retrieval in the 1960s and 1970s (p. 8). Since then distributional

semantics has developed as an empirical discipline. The last decade of work in NLP has seen

remarkable, even astonishing, progress. And yet we lack a robust theoretical framework in which

we can understand and explain that progress. Such a framework must also indicate the inherent

limitations of distributional semantics. This document is a first attempt to outline such a

framework, as such its various formulations must be seen as speculative and provisional. I offer

them so that others may modify them, replace them, and move beyond them.

It started with a comment at a blog

On July 19, 2020, Tyler Cowen made a post to Marginal Evolution entitled “GPT-3, etc.”

1

It

consisted of an email from a reader who asserted, “When future AI textbooks are written, I could

easily imagine them citing 2020 or 2021 as years when preliminary AGI first emerged. This is

very different than my own previous personal forecasts for AGI emerging in something like 20-50

years…” While I have my doubts about the concept of AGI – it’s too ill-defined to serve as

anything other than a hook on which to hang dreams, anxieties, and fears – I think GPT-3 is

worth serious consideration.

Cowen’s post has attracted 52 comments so far, more than a few of acceptable or even high

quality. I made a long comment to that post. I then decided to expand that comment into a series

of blog posts, say three or four, and then to collect them into a single document as a working

paper. When it appeared that those three or four posts would grow to five or six I decided that I

would issue two working papers. This first one would concentrate on GPT-3 and the nature of

artificial intelligence, or whatever it is. The later ones would speculate about the future and take a

quick tour of the past.

Here is a slightly revised version of the comment I made at Marginal Revolution. This paper

covers the shaded material. The rest may be covered in later papers.

2

Yes, GPT-3 [may] be a game changer. But to get there from here we need to rethink a

lot of things. And where that's going (that is, where I think it best should go) is more

than I can do in a comment.

Right now, we're doing it wrong, headed in the wrong direction. AGI, a really good

one, isn't going to be what we're imagining it to be, e.g. the Star Trek computer.

Think AI as platform, not feature (Andreessen).

3

Obvious implication, the basic

computer will be an AI-as-platform. Every human will get their own as a very young

child. They're grow with it; it’ll grow with them. The child will care for it as with a pet.

Hence we have ethical obligations to them. As the child grows, so does the pet – the pet

will likely have to migrate to other physical platforms from time to time.

1

Tyler Cowen, GPT-3, Marginal Revolution, blog post, July 19, 2020,

https://marginalrevolution.com/marginalrevolution/2020/07/gpt-3-etc.html.

2

None of them have been written as of April 26, 2022.

3

Is AI a feature or a platform? [machine learning, artificial neural nets], New Savanna, blog post,

December 13, 2019, https://new-savanna.blogspot.com/2019/12/is-ai-feature-or-platfrom-machine.html.

2

Machine learning was the key breakthrough. Rodney Brooks’ Gengis, with its

subsumption architecture, was a key development as well, for it was directed at robots

moving about in the world. FWIW Brooks has teamed up with Gary Marcus and they

think we need to add some old school symbolic computing into the mix. I think they’re

right.

Machines, however, have a hard time learning the natural world as humans do. We're

born primed to deal with that world with millions of years of evolutionary history

behind us. Machines, alas, are a blank slate.

The native environment for computers is, of course, the computational environment.

That's where to apply machine learning. Note that writing code is one of GPT-3's skills.

So, the AGI of the future, let's call it GPT-42, will be looking in two directions, toward

the world of computers and toward the human world. It will be learning in both, but in

different styles and to different ends. In its interaction with other artificial computational

entities GPT-42 is in its native milieu. In its interaction with us, well, we'll necessarily be

in the driver’s seat.

Where are we with respect to the hockey stick growth curve? For the last 3/4 quarters of

a century, since the end of WWII, we've been moving horizontally, along a plateau,

developing tech. GPT-3 is one signal that we've reached the toe of the next curve. But

to move up the curve, as I’ve said, we have to rethink the whole shebang.

We're IN the Singularity. Here be dragons.

[Superintelligent computers emerging out of the FOOM is bullshit.]

* * * * *

ADDENDUM: A friend of mine, David Porush, has reminded me that Neal

Stephenson has written of such a tutor in The Diamond Age: Or, A Young Lady's Illustrated

Primer (1995).

4

I then remembered that I have played the role of such a tutor in real life,

The Freedoniad: A Tale of Epic Adventure in which Two BFFs Travel the Universe

and End up in Dunkirk, New York.

5

* * * * *

Here’s what is in the rest of this paper:

1. Computers are strange beasts – They’re obviously inanimate, and yet we communicate

with them through language. The don’t fit pre-existing (19th century?) conceptual categories, and

so we are prone to strange views about them.

4

The Diamond Age, Wikipedia: https://en.wikipedia.org/wiki/The_Diamond_Age.

5

The Freedoniad: A Tale of Epic Adventure in which Two BFFs Travel the Universe and End up in

Dunkirk, New York, New Savanna, blog post, February 12, 2019, http://new-

savanna.blogspot.com/2014/10/the-freedoniad-tale-of-epic-adventure.html.

3

2. No meaning, no how: GPT-3 as Rubicon and Waterloo, a personal view – Arguing

from first principles it is clear that GPT-3 lacks understanding and access to meaning. And yet it

produces very convincing simulacra of understanding. But common sense understanding remains

elusive, as it did for old school symbolic processing. Much of common sense is deeply embedded

in the physical world. GPT-3, as it currently functions is, in effect, an artificial brain in a vat.

3. The brain, the mind, and GPT-3: Dimensions and conceptual spaces – GPT-3

creates a language model by examining the distances and ordering of signifiers in a collection of

text strings and computes over them so as to reverse engineer but the trajectories texts take through

a high-dimensional mental space of signifieds. Peter Gärdenfors’ semantic geometry provides a way

of thinking about the dimensionality of mental space and the multiplicity of phenomena in the

world.

4. Why is simple arithmetic difficult for deep learning systems? – This difficulty

suggests that it is unable to distinguish between episodic and semantic memory, a distinction

introduced in the previous section.

5. Metaphysics: The dimensionality of mind and world – I suggest the large-scale

language models, such as GPT-3, give evidence of what I am calling the metaphysical structure of

the world, which is a function of how human minds categorize things.

6. Gestalt switch: GPT-3 as a model of the mind – GPT-3 creates: 1) a model of a body of

natural language texts, and only a model. 2) Those texts are the product of human minds. 3)

Though the application of 2 to 1 we may conclude that GPT-3 is also a model of the mind, albeit

a very limited one. 3 requires a Gestalt switch.

7. Engineered intelligence at liberty in the world – The “intelligence” in systems such as

GPT-3 is static and reactive. To liberate and mobilize it we need to endow AI systems with

mental models of the kind investigated in “old school” symbolic AI.

Appendix: Semanticity, adhesion and relationality: I pick up from the passage in section

7 where I distinguish between a relational aspect and an intentional aspect of meaning. I now

distinguish between intentionality and semanticity, where semanticity consists of relationality and

adhesion.

4

1. Computers are strange beasts

The purpose of this working paper and the next is to set out a vision for the evolution of artificial

intelligence beyond GPT-3 (GPT: Generative Pre-trained Transformer). As I explain in the next

section, “No meaning, no how”, GPT-3 is both a remarkable achievement – we are now at sea in

the Singularity and there is no turning back – and a temptation to continue with what has worked

so far. Thus the title of this paper suggests both possibilities, GPT-3: Waterloo or Rubicon?

Here be dragons. No doubt some are already yielding to temptation and itching to build more

of the same, but others are actively resisting that impulse, and have been for awhile. What will

happen?

Of course I don’t know what will happen, but I have preferences.

The purpose of this series is to lay out those preferences. In the next section I quote extensively

from an article David Hays and I published in 1990, the first in a series of essays in which we

outlined a view of human cultural evolution over the longue durée. I conclude with some

observations of the value of being old – you’ve had plenty of failure from which to recover.

Beyond AGI

In “The Evolution of Cognition”

6

David Hays and I argued that the long-term evolution of

human culture flows from the architectural foundations of thought and communication: first

speech, then writing, followed by systematized calculation, and most recently, computation. In

discussing the importance of the computer we remark:

One of the problems we have with the computer is deciding what kind of thing it is, and

therefore what sorts of tasks are suitable to it. The computer is ontologically ambiguous.

Can it think, or only calculate? Is it a brain or only a machine?

The steam locomotive, the so-called iron horse, posed a similar problem for people at

Rank 3. It is obviously a mechanism and it is inherently inanimate. Yet it is capable of

autonomous motion, something heretofore only within the capacity of animals and

humans. So, is it animate or not? Perhaps the key to acceptance of the iron horse was

the adoption of a system of thought that permits separation of autonomous motion from

autonomous decision. The iron horse is fearsome only if it may, at any time, choose to

leave the tracks and come after you like a charging rhinoceros. Once the system of

thought had shaken down in such a way that autonomous motion did not imply the

capacity for decision, people made peace with the locomotive.

The computer is similarly ambiguous. It is clearly an inanimate machine. Yet we

interact with it through language; a medium heretofore restricted to communication

with other people. To be sure, computer languages are very restricted, but they are

languages. They have words, punctuation marks, and syntactic rules. To learn to

program computers we must extend our mechanisms for natural language.

As a consequence it is easy for many people to think of computers as people. Thus

Joseph Weizenbaum, with considerable dis-ease and guilt, tells of discovering that his

6

William L. Benzon and David G. Hays, The Evolution of Cognition, Journal of Social and Biological Structures

13(4): 297-320, 1990, https://www.academia.edu/243486/The_Evolution_of_Cognition.

5

secretary “consults” Eliza—a simple program which mimics the responses of a

psychotherapist—as though she were interacting with a real person (Weizenbaum

1976). Beyond this, there are researchers who think it inevitable that computers will

surpass human intelligence and some who think that, at some time, it will be possible for

people to achieve a peculiar kind of immortality by “downloading” their minds to a

computer. As far as we can tell such speculation has no ground in either current

practice or theory. It is projective fantasy, projection made easy, perhaps inevitable, by

the ontological ambiguity of the computer. We still do, and forever will, put souls into

things we cannot understand, and project onto them our own hostility and sexuality,

and so forth.

A game of chess between a computer program and a human master is just as

profoundly silly as a race between a horse-drawn stagecoach and a train. But the

silliness is hard to see at the time. At the time it seems necessary to establish a purpose

for humankind by asserting that we have capacities that it does not. It is truly difficult to

give up the notion that one has to add “because . . . “ to the assertion “I’m important.”

But the evolution of technology will eventually invalidate any claim that follows

“because.” Sooner or later we will create a technology capable of doing what,

heretofore, only we could.

That is where we are now. The notion of an AGI (artificial general intelligence) that will bootstrap

itself into superintelligence is fantasy; it arises because, even after three-quarters of a century,

computers are still strange to us. We design, build, and operate them; but they challenge us;

they’ve got bugs, they crash, they don’t come to heel when we command. We don’t know what

they are. That is certainly the case with GPT-3. We’ve built it; its performance amazes (but

puzzles and disappoints as well). And we do not understand how it works. It is almost as puzzling

to us as we are to ourselves. Surely we can change that, no?

We conclude our essay with this paragraph:

We know that children can learn to program, that they enjoy doing so, and that a

suitable programming environment helps them to learn (Kay 1977, Pappert 1980).

Seymour Pappert argues that programming allows children to master abstract concepts

at an earlier age. In general it seems obvious to us that a generation of 20-year-olds who

have been programming computers since they were 4 or 5 years old are going to think

differently than we do. Most of what they have learned they will have learned from us.

But they will have learned it in a different way. Their ontology will be different from

ours. Concepts which tax our abilities may be routine for them, just as the calculus,

which taxed the abilities of Leibniz and Newton, is routine for us. These children will

have learned to learn Rank 4 concepts.

Frankly, I think we were ahead of the curve on this one. Had Hays and I hazarded to predict the

advance of computing into the lives of children – “The child is Father to the man”, as

Wordsworth observed – I fear we would be disappointed by the current situation.

Yes, relatively young programmers have done remarkable things and Silicon Valley teems with

young virtuosi. It is not the virtuosi I’m concerned about. It is the average, which is too low, by

far.

Oddly enough, the current pandemic may help raise that average, though only marginally. With

at-home schooling looming in the future, school districts are beginning to buy laptop machines for

6

children whose families cannot afford them. For without those machines, those children will not

be able to participate in the only education available to them. No doubt most of the instruction

they receive through those machines will train them to be only passive consumers of computation,

as most of us are and have been conditioned to be.

But some of them surely will be curious. They’ll take a look under the virtual hood – though some

of them will undoubtedly open up the physical machine itself (not that there’s much to see, with so

much action integrated on a single chip) – and begin tinkering around. And before you know,

they’ll do interesting things and Peter Thiel is going to be handing out more of those $100,000

fellowships to teens living in institutionally impoverished neighborhoods plagued with

substandard infrastructure.

7

We’ll see.

My intellectual history

I was trained in computational semantics by the late David Hays, a first generation researcher in

machine translation and one of the founders of computational linguistics.

8

He saw the collapse of

that enterprise in the mid-1960s because it over-promised and under-delivered. He learned from

that collapse. But of course, I could not. For me it is just something that had happened in the past.

I could listen to the lessons Hays had taken from those events, and believe them, but those lessons

weren’t my lessons. I did not have to adjust my priors to accommodate to that collapse.

Symbolic AI, roughly similar to the computational semantics I learned from Hays, collapsed in

the mid-1980s. I had fully expected to see the development of symbolic systems capable of

“reading” a Shakespeare play in an intellectually interesting way.

9

That was not to be. I have

made other adjustments, in response to other events, since then. I have NOT kept to a straight

and narrow path. My road has been a winding one.

But I have kept moving.

I have always believed that you should commit yourself to the strongest intellectual position you

can, but not in the expectation that it will pan out or that it is your duty to make it pan out come

hell or high water. No, you do it because it maximizes your ability to learn from what you got

wrong. If you don’t establish firm priors, you can’t correct them effectively.

My intellectual career has thus been a long sequence of error-correcting maneuvers. Have I got it

right at long last?

Are you crazy?

This section and the ones to follow are no more than my best assessment of the current situation,

subject to the fact that I’m doing this quickly. I will surely be wrong in many particulars, and

perhaps in overall direction as well. Consider this paper to be a set of Bayesian priors subject to

correction by later events.

7

Peter Thiel offers $100,000 fellowships to talented young people provided they drop out of college so they

can do new things, https://thielfellowship.org/.

8

David G. Hays, Wikipedia, https://en.wikipedia.org/wiki/David_G._Hays.

9

See the discussion of the Prospero system on pages 271-273 of William Benzon and David G. Hays,

Computational Linguistics and the Humanist, Computers and the Humanities, Vol. 10. 1976, pp. 265-274,

https://www.academia.edu/1334653/Computational_Linguistics_and_the_Humanist.

7

2. No meaning, no how: GPT-3 as Rubicon and Waterloo, a

personal view

I say that not merely because I am a person and, as such, I have a point of view on GPT-3 and

related matters. I say so because this discussion is informal, without journal-class examination of

this, that, and the other, along with the attendant burden of citation, though I will offer some

citations. I am trying to figure out just what it is that I think, and see value in doing so in public.

What value, you ask? It commits me to certain ideas, if only provisionally. It lays out a set of priors

and thus serves to sharpen my ideas as developments unfold and I reconsider and adjust.

GPT-3 represents an achievement of a high order; it deserves the attention it has received, if not

the hype. We are now deep in “here be dragons” territory and we cannot go back. And yet, if we

are not careful, we’ll never leave the dragons, we’ll always be wild and undisciplined. We will

never actually advance; we’ll just spin faster and faster. Hence GPT-3 is both a Rubicon, the

crossing of a threshold, and a potential Waterloo, a battle the AI community cannot win. If it

chooses to fight, that is, to continue with the largely empirical methods that have brought success

so far, it will loose as machine translation did in the mid 1960s and symbolic AI did in the mid-

1980s. By all means, continue the building and experimenting. But take a few moments to step

back and reflect about the enterprise and so develop a deeper understanding of what has been

done in the past and what can and should be done in the future.

Here’s my plan for this section of the paper: First we take a look at history, at the origins of

machine translation and symbolic AI. Next I develop a fairly standard critique of models such as

those used in GPT-3 and follow it with similar remarks by Martin Kay, one of the Grand Old

Men of computational linguistics. Then I look at the problem of common-sense reasoning and

conclude by looking ahead to the next stage of exposition in which I offer some speculations on

why (and perhaps even how) these models can succeed despite their severe and fundamental

short-comings.

Background: MT and Symbolic computing

It all began with a famous memo Warren Weaver wrote in 1949. Weaver was director of the

Natural Sciences division of the Rockefeller Foundation from 1932 to 1955. He collaborated with

Claude Shannon on the publication of a book which popularized Shannon’s seminal work in

information theory, The Mathematical Theory of Communication. Weaver’s 1949 memorandum, simply

entitled “Translation”

10

, is regarded as the catalytic document in the origin of machine translation

(MT) and hence of computational linguistics (CL) and heck! why not? artificial intelligence (AI).

Let’s skip to the fifth section of Weaver’s memo, “Meaning and Context” (p. 8):

First, let us think of a way in which the problem of multiple meaning can, in principle at

least, be solved. If one examines the words in a book, one at a time as through an

opaque mask with a hole in it one word wide, then it is obviously impossible to

determine, one at a time, the meaning of the words. “Fast” may mean “rapid”; or it

may mean “motionless”; and there is no way of telling which.

10

Warren Weaver, “Translation”, Carlsbad, NM, July 15, 1949, 12. pp. Online: http://www.mt-

archive.info/Weaver-1949.pdf.

8

But if one lengthens the slit in the opaque mask, until one can see not only the central

word in question, but also say N words on either side, then if N is large enough one can

unambiguously decide the meaning of the central word. The formal truth of this

statement becomes clear when one mentions that the middle word of a whole article or

a whole book is unambiguous if one has read the whole article or book, providing of

course that the article or book is sufficiently well written to communicate at all.

It wasn’t until the 1960s and ‘70s that computer scientists would make use of this insight; Gerard

Salton was the central figure and he was interested in document retrieval.

11

Salton would

represent documents as a vector of words and then query a database of such representations by

using a vector composed from user input. Documents were retrieved as a function of similarity

between the input query vector and stored document vectors.

Work on MT went a different way. Various approaches were used, but at some relatively early

point researchers were writing formal grammars of languages. In some cases these grammars were

engineering conveniences while in others they were taken to represent the mental grammars of

humans. In any event, that enterprise fell apart in the mid-1960s. The prospects for practical

results could not justify federal funding and the government had little interest in supporting purely

scientific research into the nature of language.

Such research continued nonetheless, sometimes under the rubric of computational linguistics

(CL) and sometimes as AI. I encountered CL in graduate school in the mid-1970s when I joined

the research group of David Hays in the Linguistics Department of the State University of New

York at Buffalo – I was actually enrolled as a graduate student in English.

Many different semantic models were developed in various research groups, but we don’t need

anything like a review of that work, just a little taste. In particular let us look at a general type of

model known as a semantic or cognitive network. Hays had been developing such a model for

some years in conjunction with several graduate students.

12

Here’s a fragment of a network from a

system developed by one of those students, Brian Philips, to tell whether or not newspaper stories

of people drowning were tragic.

13

Here’s a representation of the semantics of capsize:

11

David Durbin, The Most Influential Paper Gerard Salton Never Wrote, Library Trends, vol. 52, No. 4,

Spring 2004, pp. 748-764.

12

For a basic account of cognitive networks, see David G. Hays. Networks, Cognitive. In (Allen

Kent, Harold Lancour, Jay E. Daily, eds.): Encyclopedia of Library and Information Science, Vol. 19.

Marcel Dekker, Inc., NY 1976, 281-300. You can download a copy here,

https://www.academia.edu/10900362/Networks_Cognitive.

13

Brian Phillips. A Model for Knowledge and Its Application to Discourse Analysis, American Journal of

Computational Linguistics, Microfiche 82, (1979).

9

Notice that there are two kinds of nodes in the network, square ones and round ones. The square

ones represent a scene while the round ones represent individual objects or events. Thus the

square node at the upper left indicates a scene with two sub-scenes – I’m just going to follow out

the logic of the network without explaining it in any detail. The first one asserts that there is a boat

that contains one Horatio Smith. The second one asserts that the boat overturns. And so forth through

the rest of the diagram.

This network represents semantic structure. In the terminology of semiotics, it represents a

network of signifieds. Though Philips didn’t do so, it would be entirely possible to link such a

semantic network with a syntactic network (composed of signifiers), and many systems of that era

did so.

Such networks were symbolic in the (obvious) sense that the objects in them were considered to be

symbols, not sense perceptions or motor actions nor, for that matter, neurons, whether real or

artificial. The relationship between such systems and the human brain was not explored, either in

theory or in experimental observation. It wasn’t an issue.

That enterprise collapsed in the mid-1980s. Why? The models had to be hand-coded, which took

time. They were computationally expensive and so-called common sense reasoning proved to be

endless, making the models larger and larger. (I discuss common sense below and I have many

posts at New Savanna on the topic.

14

)

The work didn’t stop entirely. Some researchers kept at it. But interests shifted toward machine

learning techniques and toward artificial neural networks. That is the line of evolution that has,

three or four decades later, resulted in systems like GPT-3, which also owe a debt to the vector

semantics pioneered by Salton. Such systems build huge language models from huge corpora –

GPT-3 is based on 300 billion tokens

15

– and contain no explicit models of syntax or semantics

anywhere, at least not that researchers can recognize.

14

My various posts on common sense are at this link: https://new-

savanna.blogspot.com/search/label/common%20sense%20knowledge.

15

Tom B. Brown, Benjamin Mann, Nick Ryder, et al. Language Models are Few-Shot Learners,

arXiv:2005.14165v4 [cs.CL] 5 June 2020, p. 8. (https://arxiv.org/abs/2005.14165v4)

10

Researchers build a computational system that constructs a language model (“learns” the

language), but the inner workings of that model are opaque to the researchers. The system built

the model, not the researchers. They only built the system.

It is a strange situation.

No words, only signifiers

Let us start with the basics. There are no words as such in the corpus on which GPT-3’s language

model is based. Words have spelling, pronunciation, meaning, often various meanings,

grammatical usage, and connotations and implications. All that exists in the corpus are spellings,

bare naked signifiers. No signifieds, that is to say, semantics and more generally concepts and even

percepts. And certainly no referents, things and situations in the world to which words refer. The

corpus is thus utterly empty signification, and yet it exhibits structure and the order in GPT-3’s

language model derives from that order.

Remember, the corpus consists of words people wrote while negotiating their way in the world. In

their heads they’ve got a highly structured model of the world, a semantics (which rides on

perception and cognition). Let us say, for the moment, that the semantic model is

multidimensional. Linguistic syntax maps that multidimensional semantics onto a one-

dimensional string which can be communicated through speech, gesture, or writing.

GPT-3 has access only to those strings. It ‘knows’ nothing of the world, nor of syntax, much less of

semantics, cognition and perception. What it is ‘examining, in those strings, however, reflects the

interaction of human minds and the world. While there is an obvious sense in which the structure in

those strings comes from the human mind, we also have to take the world into account. For the

people who created those strings were not just spinning language out for the fun of it – oh, some of them

were. But even poets, novelists, and playwrights attend to the world’s structure in their writing.

What GPT-3 recovers and constructs from the data it ingests is thus a simulacrum of the

interaction between people and the world. There is no meaning there. Only entanglement. And

yet what it does with that entanglement is very interesting and has profound consequences – but

more of that later.

* * * * *

Now, we as users, as clients of such systems, are fooled by our semiotic naiveté. Even when we’ve

taken semiotics 101, we look at a written signifier and we take it for a word, automatically and

without thought, with its various meanings and implications. But it isn’t a word, not really.

Yes, in normal circumstances – talking with one another, reading various documents – it makes

sense for us to treat signifiers as words. As such those signifiers are linked to signifieds (semantics,

concepts, percepts) and referents (things and situations in the world). But output from GPT-3 is

not normal circumstances. It’s working from a huge corpus of signifiers, but no matter how you

bend, fold, spindle, or mutilate those signifiers, you’re not going to get a scintilla of meaning. Any

meaning you see, is meaning you put there.

Where did those signifiers come from? That’s right, those millions if not billions of people writing

away. Writing about the world. So there is in fact something about the world intertwined with

those signifiers, just as there is something about the structure of the minds that composed them.

The structure of minds and of the world have become entangled and projected onto one

11

freakishly long and entangled pile of strings. That is what GPT-3 works with to generate its

language model.

Let me repeat this once again, obvious though it is: Those words in the corpus were generated by

people conveying knowledge of, attempting to make sense of, the world. Those strings are coupled

with the world, albeit asynchronously. Without that coupling, that corpus would collapse into an

unordered pile of bare naked signifiers. It is that coupling with the world the authorizes our

treatment of those signifiers as full-on words.

We need to be clear on the distinction between the language system as it exists in the minds of

people and the many and various texts those people generate as they employ that system to

communicate about and make sense of the world. It would be a mistake to think that the GPT-3

language model is only about what is inside people’s heads. It is also about the world, for those

people use what is in their heads to negotiate their way in the world. [I intend to “cash out” on

my insistence on this point in the next section.]

Martin Kay, “an ignorance model”

With that in mind let us consider what Martin Kay has to say about statistical language

processing. Martin Kay is one of the Grand Old Men of computational linguistics. He was

originally trained in Great Britain by Margaret Masterman, a student of Ludwig Wittgenstein,

and moved to the United States in the 1950s where he worked with by teacher and colleague,

David Hays. Before he had come to SUNY Buffalo Hays had run the RAND Corporation’s

program in machine translation.

In the early 2000s the Association for Computational Linguistics gave Kay a lifetime achievement

award and he delivered some remarks on that occasion.

16

At the end he says (p. 438):

Statistical NLP has opened the road to applications, funding, and respectability for our

field. I wish it well. I think it is a great enterprise, despite what I may have seemed to say

to the contrary.

Prior to that he had this to say (437):

Symbolic language processing is highly nondeterministic and often delivers large

numbers of alternative results because it has no means of resolving the ambiguities that

characterize ordinary language. This is for the clear and obvious reason that the

resolution of ambiguities is not a linguistic matter. After a responsible job has been done

of linguistic analysis, what remain are questions about the world. They are questions of

what would be a reasonable thing to say under the given circumstances, what it would

be reasonable to believe, suspect, fear, or desire in the given situation. If these questions

are in the purview of any academic discipline, it is presumably artificial intelligence. But

artificial intelligence has a lot on its plate and to attempt to fill the void that it leaves

open, in whatever way comes to hand, is entirely reasonable and proper. But it is

important to understand what we are doing when we do this and to calibrate our

expectations accordingly. What we are doing is to allow statistics over words that occur

very close to one another in a string to stand in for the world construed widely, so as to

include myths, and beliefs, and cultures, and truths and lies and so forth. As a stop-gap

for the time being, this may be as good as we can do, but we should clearly have only

16

Martin Kay, A Life of Language, Computational Linguistics, Volume 31 Issue 4, December 2005, pp. 425-

438, http://web.stanford.edu/~mjkay/LifeOfLanguage.pdf.

12

the most limited expectations of it because, for the purpose it is intended to serve, it is

clearly pathetically inadequate. The statistics are standing in for a vast number of things

for which we have no computer model. They are therefore what I call an “ignorance

model.”

An “ignorance model” – that, more or less, is what I said in the previous section when I said that

GPT-3 had no access to semantics.

There are some thinkers who believe it is necessary to somehow integrate these statistical methods

with the “old school” symbolic methods pioneered by Kay and by David Hays and many others.

That’s what I think, and have thought so for some time. I know that Gary Marcus and Rodney

Brooks have a start-up built on that premise, Robust AI.

17

So does David Ferrucci, who headed

IBM’s Watson project when it won at Jeopardy; Elemental Cognition is his start-up.

18

Common sense reasoning about the world

As I’ve already observed, one of the problems that sunk symbolic processing in the mid-1980s is

that of common sense reasoning. This has become an issue with current work as well. While

GPT-3 does exhibit impressive performance in this area, there are signs that not all is as it seems.

Kevin Lacker observes:

19

Traditionally, artificial intelligence struggles at “common sense”. But GPT-3 can

answer a lot of common-sense questions.

Q: What is your favorite animal?

A: My favorite animal is a dog.

Q: Why?

A: Because dogs are loyal and friendly.

Q: What are two reasons that a dog might be in a bad mood?

A: Two reasons that a dog might be in a bad mood are if it

is hungry or if it is hot.

Q: How many eyes does a giraffe have?

A: A giraffe has two eyes.

Q: How many legs does a frog have?

A: A frog has four legs.

Q: Are there any animals with three legs?

A: No, there are no animals with three legs.

Q: Why don't animals have three legs?

A: Animals don't have three legs because they would fall

over.

Ten years ago, if I had this conversation, I would have assumed the entity on the other

end was a human. You can no longer take it for granted that an AI does not know the

answer to “common sense” questions.

Yes. But Lacker goes on:

We need to ask questions that no normal human would ever talk about.

17

Robust AI, https://www.robust.ai/.

18

Elemental Cognition, https://www.elementalcognition.com/.

19

Kevin Lacker's blog, Giving GPT-3 a Turing Test, https://lacker.io/ai/2020/07/06/giving-gpt-3-a-

turing-test.html.

13

Q: How many eyes does a giraffe have?

A: A giraffe has two eyes.

Q: How many eyes does my foot have?

A: Your foot has two eyes.

Q: How many eyes does a spider have?

A: A spider has eight eyes.

Q: How many eyes does the sun have?

A: The sun has one eye.

Q: How many eyes does a blade of grass have?

A: A blade of grass has one eye.

Now we’re getting into surreal territory. GPT-3 knows how to have a normal

conversation. It doesn’t quite know how to say “Wait a moment... your question is

nonsense.” It also doesn’t know how to say “I don’t know.”

It seems that, if you probe a bit, GPT-3 does have problems with common sense reasoning.

A lot of common-sense reasoning takes place “close” to the physical world. I have come to

believe, but will not here argue, that much of our basic (‘common sense’) knowledge of the

physical world is grounded in analogue and quasi-analogue representations.

20

This gives us the

power to generate language about such matters on the fly. Old school symbolic machines did not

have this capacity nor do current statistical models, such as GPT-3.

But then how can a system generate analog or quasi-analog representations of the world unless it

has direct access to the world? The creators of GPT-3 acknowledge this as a limitation:

Finally, large pretrained language models are not grounded in other domains of

experience, such as video or real-world physical interaction, and thus lack a large

amount of context about the world [BHT+20]. For all these reasons, scaling pure self-

supervised prediction is likely to hit limits, and augmentation with a different approach

is likely to be necessary. Promising future directions in this vein might include learning

the objective function from humans [ZSW+19a], fine-tuning with reinforcement

learning, or adding additional modalities such as images to provide grounding and a

better model of the world [CLY+19].

21

And yet GPT-3 seems so effective. How can that be?

The above critique is from first principles and, as such, seems to me to be unassailable. Equally

unassailable, however, are the facts on the ground: these systems do work. And here I’m not

talking only about GPT-3 and its immediate predecessors. I’ve done much of my thinking about

these matters in connection with other kinds of systems based on distributional semantics, such as

topic modeling, for one example.

20

For a superb analog model see William Powers, Behavior: The Control of Perception (Aldine) 1973. Don’t let

the publication date fool; Powers develops his model with a simplicity and elegance that makes it well worth

our attention even now, almost 50 years later. Hays integrated Powers’ model into his cognitive network

model, see David G. Hays, Cognitive Structures, HRAF Press, 1981. Also, see my post, “Computation, Mind,

and the World [bounding AI]”, New Savanna, blog post, December 28, 2019, https://new-

savanna.blogspot.com/2019/12/computation-mind-and-world-bounding-ai.html.

21

Tom B. Brown, Benjamin Mann, Nick Ryder, et al. Language Models are Few-Shot Learners,

arXiv:2005.14165v4 [cs.CL] 5 June 2020, p. 34. (https://arxiv.org/abs/2005.14165v4)

14

Thus I have little choice, it seems, but to hazard an account of just why these models are effective.

That’s my task for the next section, “The brain, the mind, and GPT-3: Dimensions and

conceptual spaces”. Note that I do not mean to explicate the computational processes used in

GPT-3, not at all. Rather, I am going to speculate about what there is in the nature of the mind,

and perhaps even of the world, that allows such mechanisms to succeed.

It is common to think of language as loose, fuzzy, and imprecise. And so it is. But that cannot and

is not all there is to language. In order for language to work at all there must be a rigid and

inflexible aspect to it. That is what I’ll be talking about in the next section. I’ll be building on

theoretical work by Sydney Lamb, Peter Gärdenfors, and a comment Graham Neubig made in a

discussion about semantics and machine learning.

15

3. The brain, the mind, and GPT-3: An “isometric

transform” onto meaning space

The purpose of this section is to sketch a conceptual framework in which we can understand the

success of language models such as GPT-3 despite the fact that they are based on nothing more

than massive collections of unadorned signifiers. I have no intention of attempting to explain how

GPT-3 works. That it does work, in an astonishing variety of cases if (certainly) not universally, is

sufficient for my purposes.

First of all I present the insight that sent me down this path, a comment by Graham Neubig in an

online conversation that I was not a part of. Then I set that insight in the context of and insight by

Sydney Lamb (meaning resides in relations), a first-generation researcher in machine translation

and computational linguistics. I think take a grounding case by Julian Michael, that of color, and

suggest that it can be extended by the work of Peter Gärdenfors on conceptual spaces.

A clue: an isomorphic transform into meaning space

At the 58th Annual Meeting of the Association for Computational Linguistics Emily M. Bender

and Alexander Koller delivered a paper, Climbing towards NLU: On Meaning, Form, and

Understanding in the Age of Data

22

, where NLU means natural language understanding. The

issue is pretty much the one I laid out in my previous section in the sections “No words, only

signifiers” and “Martin Kay, ‘an ignorance model’”. A lively discussion ensured online which

Julian Michael has summarized and commented on in a recent blog post.

23

In that post Michael quotes a remark by Graham Neubig:

24

One thing from the twitter thread that it doesn’t seem made it into the paper... is the

idea of how pre-training on form might learn something like an “isomorphic transform”

onto meaning space. In other words, it will make it much easier to ground form to

meaning with a minimal amount of grounding. There are also concrete ways to

measure this, e.g. through work by Lena Voita or Dani Yogatama... This actually seems

like an important point to me, and saying “training only on form cannot surface

meaning,” while true, might be a little bit too harsh— something like “training on form

makes it easier to surface meaning, but at least a little bit of grounding is necessary to do

so” may be a bit more fair.

That’s my point of departure in this section, that notion of “an ‘isomorphic transform’ onto

meaning space.” I am going to sketch a framework in which we can begin unpacking that idea.

But it may take awhile to get there.

22

Emily M. Bender and Alexander Koller, Climbing towards NLU: On Meaning, Form, and

Understanding in the Age of Data, Proceedings of the 58th Annual Meeting of the Association for

Computational Linguistics, pages 5185–5198 July 5 - 10, 2020.

23

Julian Michael, To Dissect an Octopus: Making Sense of the Form/Meaning Debate,

https://blog.julianmichael.org/2020/07/23/to-dissect-an-

octopus.html?fbclid=IwAR0LfzVkrmiMBggkm0tJyTN8hgZks5bN0b5Wg4MO96GWZBx9Fom

qhIJH4LQ.

24

He’s an Associate Professor at Carnegie Mellon. His website, http://www.phontron.com/.

16

Meaning is in relations

I want to develop an idea I have from Sydney Lamb, that meaning resides in relations. The idea

as Lamb understood it emerged in the “old school” world of symbolic computation, where

language is conceived as a relational network of items. The meaning of any item in the network is

a function of its position in the network. Note that this means that I am assuming that the mind is

constituted, in part, by something like an old school symbol model; that’s not all that constitutes

the mind, not by any means (recall p. 13). It is that symbolic model that is the object of Neubig’s

“isometric transform”.

Let’s start with this simple diagram:

It represents the fact that the central nervous system (CNS) is coupled to two worlds, each

external to it. To the left we have the external world. The CNS is aware of that world through

various senses (vision, hearing, smell, touch, taste, and perhaps others) and we act in that world

through the motor system. But the CNS is also coupled to the internal milieu, with which it shares

a physical body. The net is aware of that milieu by chemical sensors indicating contents of the

blood stream and of the lungs, and by sensors in the joints and muscles. And it acts in the world

through control of the endocrine system and the smooth muscles. Roughly speaking the CNS

guides the organism’s actions in the external world so as to preserve the integrity of the internal

milieu. When that integrity is gone, the organism is dead.

Now consider this more differentiated presentation of the same facts:

17

I have divided the CNS into four sections: A) senses the external world, B) senses the internal

milieu, C) guides action in the internal milieu, and D) guides action in the external world. I rather

doubt that even a very simple animal, such as C. elegans, with 302 neurons, is so simple. But I trust

my point will survive that oversimplification.

Lamb’s point is that the “meaning” or “significance” of any of those nodes – let’s not worry at the

moment whether they’re physical neurons or more abstract entities – is a function of its position

in the entire network, with its inputs from and outputs to the external world and the inner

milieu.

25

To appreciate the full force of Lamb’s point we need to recall the diagrams typical of old

school symbolic computing, such as this diagram from Brian Philips we saw in the previous

section:

All of the nodes and edges have labels. Lamb’s point is that those labels exist for our convenience,

they aren’t actually a part of the system itself. If we think of that network as a fragment from a

human cognitive system – and I’m pretty sure that’s how Philips thought about it, even if he could

not justify it in detail (no one could, not then, not now) – then it is ultimately connected to both

the external world and the inner milieu. All those labels fall away; they serve no purpose. Alas,

Philips was not building a sophisticated robot, and so those labels are necessary fictions.

But we’re interested in the real case, a human being making their way in the world. In that case

let us assume that, for one thing, the necessary diagram is WAY more complex, and that the

nodes and edges do not represent individual neurons. Rather, they represent various entities that

are implemented in neurons, sensations, thoughts, perceptions, and so forth. Just how such things

are realized in neural structures is a matter of some importance and is being pursued by hundreds

of thousands of investigators around the world. But we need not worry about that now. We’re

about to fry some rather more abstract fish.

Some of those nodes will represent signifiers, to use the Saussurian terminology I used earlier, and

some will represent signifieds. What’s the difference between a signifier and a signified? Their

25

Sydney Lamb, Pathways of the Brain, John Benjamins, 1999. See also Lamb’s most recent account

of his language model, Sydney M. Lamb, Linguistic Structure: A Plausible Theory, Language Under

Discussion, 2016, 4(1): 1–37. https://journals.helsinki.fi/lud/article/view/229.

18

position in the network as a whole. That’s all. No more, no less. Now, it seems to me, we can

begin thinking about Neubig’s “isomorphic transform” onto meaning space.

Let us notice, first of all, that language exists as strings of signifiers in the external world. In the

case that interests us, those are strings of written characters that have been encoded into

computer-readable form. Let us assume that the signifieds – which bear a major portion of

meaning, no? – exist in some high dimensional network in mental space. This is, of course, an

abstract space rather than the physical space of neurons, which is necessarily three dimensional.

However many dimensions this mental space has, each signified exists at some point. Just how this

conceptual space is implemented in populations of neurons is a matter of considerable interest,

but we need not consider that here.

26

What happens when one writes? You produce a string of signifiers. The distance between signifiers

on this string, and their ordering relative to one another, are a function of the relative distances

and orientations of their associated signifieds in mental space. Perhaps that’s where to look for

Neubig’s isometric transform into meaning space. What GPT-3, and other NLP engines, does is to

examine the distances and ordering of signifiers in the string and compute over them so as to reverse engineer the

distances, orientations and relations of the associated signifieds in high-dimensional mental space.

Let us recall the classic distinction between semantic and episodic memory. Roughly speaking

semantic memory is like a dictionary; it is a basic inventory of concepts. Episodic memory, first

characterized, I believe, by Endel Tulving

27

, is more like an encyclopedia attached to histories

and news reports. Semantic memory

28

is an inventory of types while entries in episodic memory

consist of tokens of those types. The square nodes in the Brian Philips network diagram represent

episodes while the circles represent semantic entities (p. 17).

What is the size of human semantic memory? One estimate places the number of English words

at roughly a million.

29

No one individual will know all those words, but the corpus GPT-3 is based

on is certainly not the product of a single individual. Let us assume, then, for the sake of argument

that GPT-3, or a similar engine, includes roughly one million word types (I believe that GPT-3’s

language model is in fact based on less that 100,000 words). Or, to be more precise, a million

signifier types, and they in turn correspond to a million signified types in mental space. We

certainly don’t need 175 billion parameters to characterize those million signifieds. We need them

to characterize all the episodes in the text base.

Imagine some space adequate to characterize our million signifieds. Maybe it has 10, 100, or

more dimensions; as long as the signifieds are usefully differentiated from one another just how

that is done is secondary. Each signified occupies a point in that space. An episode, then, is a path

26

I offer some preliminary speculation in William Benzon, Attractor Nets, Series I: Notes Toward a New Theory of

Mind, Logic, and Dynamics in Relational Networks, Working Paper, 52 pp.,

https://www.academia.edu/9012847/Attractor_Nets_Series_I_Notes_Toward_a_New_Theory_of_Mind

_Logic_and_Dynamics_in_Relational_Networks.

27

Endel Tulving, Episodic and semantic memory, in: Endel Tulving and Wayne !Donaldson (Eds.),

Organization of Memory (Academic Press, New York, 1972) 382 - 403.

28

Semantic memory is sometimes said to constitute an ontology. However, I suspect that human semantic

memory is not so orderly as the ontologies proposed by knowledge engineers. See Wikipedia’s entry,

Ontology (information science), https://en.wikipedia.org/wiki/Ontology_(information_science), and John

Sowa’s Ontology page, http://www.jfsowa.com/ontology/.

29

“How Many Words Are There In The English Language?” Dictionary.com, accessed August 1,

2020, https://www.dictionary.com/e/how-many-words-in-english/.

19

or a trajectory in the space.

30

An episode might only a single token, or ten, thirty, 256 tokens; but

it might also have 100,000 or more tokens. Those 175 billion parameters are characterizing those

episodes. When you present GPT-3 with a prompt, it treats that prompt as the initial segment of

an episode and continues that trajectory to complete the episode.

In this framework we can think of the extended meaning of a semantic type as being a function of

the way its tokens participate in episode strings. I say extended because it is also defined in that base

space that distinguishes the types from one another. This extended meaning has a Wittgensteinian

feel, word meaning resides in their use.

However, even that extended meaning is not entirely adequate, as some of Kevin Lacker’s

common-sense examples suggest (p. 12 above). In the real world many basic concepts are

grounded in sensory-motor schemas of one kind or another, the image of dog, the sound of a bird,

or the taste of salt. GPT-3 doesn’t have access to such schemas. Some of that information,

however, is characterized by sentences and phrases, that is, by episodes, and GPT-3 does have

access to those.

Many things, moreover, are characterized in multiple ways. Ordinary table salt is characterized

by its taste, appearance, and haptic feel. To the chemist, however, table salt consists mostly of

sodium chloride (NaCl), along with traces of various impurities. Conceptually sodium chloride

didn't even exist until the nineteenth century. While the substance is concrete, the

conceptualization is abstract. The same is true for dogs and cats and pine trees and grasses.

Young children recognize them by how the look, sound, feel, smell, and taste. The professional

biologist, however, has altogether more abstract ways of characterizing them. And so it goes for

the entirely of the natural world and the sciences that have arisen to study those phenomena. We

live amid multiple over-lapping ontologies.

31

What does GPT-3 “know” of such things? We could, I suppose, ask, couldn’t we?

30

I’ve explored the notion of texts as paths in semantic space, William Benzon, Virtual Reading: The Prospero

Project Redux, Working Paper, Version 2, October 2018, 37 pp.,

https://www.academia.edu/34551243/Virtual_Reading_The_Prospero_Project_Redux.

31

I’ve written a bit about multiple ontologies. See William Benzon, Ontology of Common Sense, Hans

Burkhardt and Barry Smith, eds. Handbook of Metaphysics and Ontology, Muenchen: Philosophia

Verlag GmbH, 1991, pp. 159-161. The final draft is online,

https://www.academia.edu/28723042/Ontology_of_Common_Sense; Ontology in Knowledge Representation for

CIM, Center for Manufacturing Productivity and Technology Transfer, Rensselaer Polytechnic Institute.

Report No. CIMNW85TR034, January 1985,

https://www.academia.edu/19804747/Ontology_in_Knowledge_Representation_for_CIM.

20

4. Why is simple arithmetic difficult for deep learning

systems?

Video: Gary Marcus - Towards a Proper Foundation for Artificial General Intelligence:

https://youtu.be/8VWQQbngxXY

Gary Marcus points this out at two points in the video: c. 18:25 (multiplication of 2-digit

numbers), c. 19:49 (3-digit addition). Why is this so difficult for deep learning models to grasp

this? This suggests a failure to distinguish between semantic and episodic memory, to use terms

from Old School symbolic computation that I introduced in the previous section.

The question interests me because arithmetic has calculation has well-understood procedures. We

know how people do it. And by that I mean that there’s nothing important about the process that’s

hidden, unlike our use of ordinary language. The mechanisms of both sentence-level grammar

and discourse structure are unconscious.

It's pretty clear to me that arithmetic requires episodic structure, to introduce a term from old

symbolic-systems AI and computational linguistics. That’s obvious from the fact that we don’t

teach it to children until grammar school, which is roughly episodic level cognition kicks in (see

the paper Hays and I did on natural intelligence

32

).

I note that, while arithmetic is simple, it’s simple only in that there are no subtle conceptual issues

involved. But fluency requires years of drill. First the child must learn to count; that gives numbers

meaning. Once that is well in hand, children are drilled in arithmetic tables for the elementary

operations: addition, subtraction, multiplication, and division. The learning of addition and

subtraction tables proceeds along with exercises in counting, adding and subtracting items in

collections. Once this is going smoothly one learns the procedures multiple-digit addition and

subtraction, multiple-operand addition and then multiplication and division. Multiple digit

division is the most difficult because it requires guessing, which is then checked by actual

calculation (multiplication followed by subtraction).

Why does such intellectually simple procedures require so much drill? Because each individual

step must be correct. One mistake anywhere, and the whole calculation is thrown off. You need to

recall atomic facts (from the tables) many times in a given calculation and keep track of

intermediate results. The human mind is not well-suited to that. It doesn’t come naturally. Drill is

required. That drill is being managed by episodic cognition.

It would seem that GPT-3 cannot pick up that kind of episodic structure. The question is: Can it

pick up any kind of episodic structure at all? I don’t know.

When humans produce the kind of coherent prose that these GPT-3 does, they are using episodic

cognition. But that episodic cognition is unconscious. Does GPT-3 pick up episodic cognition of

32

William Benzon and David G. Hays, A Note on Why Natural Selection Leads to Complexity, Journal of

Social and Biological Structures 13: 33-40, 1990,

https://www.academia.edu/8488872/A_Note_on_Why_Natural_Selection_Leads_to_Complexity.

21

that kind? As I say, I don’t know. But I can imagine that it does not. If not, then what is GPT-3

doing to produce such convincing simulacra of coherent prose? I am tempted to say it is doing it

all with systemic-level cognition, but that may be a mistake as well. GPT-3 is doing it with some

other mechanism, one that doesn’t differentiate between semantic and episodic level.

22

5. Metaphysics: The dimensionality of mind and world

Let’s bring this down to earth. Let’s return to Bender and Koller, who proposed a thought

experiment involving a superintelligent octopus listening in on a conversation between two

people. Julian Michael proposes the following:

As a concrete example, consider an extension to the octopus test concerning color—a

grounded concept if there ever was one. Suppose our octopus O is still underwater, and

he:

• Understands where all color words lie on a spectrum from light to dark... But he

doesn’t know what light or dark mean.

• Understands where all color words lie on a spectrum from warm to cool... But he

doesn’t understand what warm or cool mean.

• Understands where all color words lie on a spectrum of saturated to washed out...

But he doesn’t understand what saturated or washed-out mean.

Et cetera, for however many scalar concepts you think are necessary to span color space

with sufficient fidelity. A while after interposing on A and B, O gets fed up with his

benthic, meaningless existence and decides to meet A face-to-face. He follows the cable

to the surface, meets A, and asks her to demonstrate what it means for a color to be light,

warm, saturated, etc., and similarly for their opposites. After grounding these words, it

stands to reason that O can immediately ground all color terms—a much larger subset of

his lexicon. He can now demonstrate full, meaningful use of words like green and lavender,

even if he never saw them used in a grounded context. This raises the question: When, or

from where, did O learn the meaning of the word “lavender”?

It’s hard for me to accept any answer other than “partly underwater, and partly on

land.” Bender acknowledges this issue in the chat as well:

The thing about language is that it is not unstructured or random, there is a lot of information there in

the patterns. So as soon as you can get a toe hold somewhere, then you can (in principle, though I don’t

want to say it’s easy or that such systems exist), combine the toe hold + the structure to get a long ways.

The thing about color is that it is much investigated and well (if not completely) understood, from

genetics up through cultural variation in color terms. And color is understood in terms of three

dimensions, hue (warm to cool), saturation, and brightness (light to dark).

And that brings us to the work of Peter Gärdenfors, who has developed a very sophisticated

geometry of conceptual spaces where each space is organized along one or more dimensions.

33

And he means real geometry, not geometry as metaphor. He starts with color, but then, over the

course of two books, extends the idea of conceptual spaces and their constitutive dimensions to a

wide and satisfying range of examples.

This is not the time and place to even attempt a précis of his theory. But I note, for example, that

he has interesting treatments of properties, animal concepts, metaphor, prepositions, induction,

and computation in Conceptual Spaces. His more recent book, The Geometry of Meaning, has chapters

33

Peter Gärdenfors, Conceptual Spaces: The Geometry of Thought, MIT Press, 2000; The Geometry of Meaning:

Semantics Based on Conceptual Spaces, MIT Press, 2014.

23

on semantic domains, meeting of minds (in interaction), the semantics of nouns, adjectives, and

actions, and propositions and compositionality. As a starting point I recommend his recent

article

34

, which also contains some remarks about computational implementation, as a starting

point. In our immediate context a crucial point is that Gärdenfors regards his account of mental

spaces as being different from both classic symbolic accounts of mind (such as that embodied by

the Brian Philips example) and artificial neural networks, such as GPT-3. Though I am perhaps

interpreting him a bit, he sees mental spaces as a tertium quid between the two. In particular, to the

extent that Gärdenfors is more or less correct, we have a coherent and explicit way of understanding the

success of neural network models such as GPT-3. That is, if the world, on the one hand, and the human

sensorium and motor system, on the other, are like that, then the success of GPT-3 is intelligible on

those terms.

The metaphysical structure of the world

It seems to me that what Gärdenfors is looking at is what me might call, for lack of a better term,

the metaphysical structure of the world.

The metaphysical structure of the world?!

I don’t mean physical structure of the world, which is a subject for the various physical and, I

suppose, biological sciences. I mean metaphysical. Just what that is, I’m not sure. The metaphysical

structure of the world is that structure that makes the world intelligible to us; it exists in the

relationship between us, homo sapiens sapiens, and the world. What is the world that it is perceptible,

that we can move around in it in a coherent fashion? Whatever it is, it is the product of millions of

years of evolution in which animals have had to make their way in the world.

Imagine, in contrast, that the world consisted entirely of elliptically shaped objects. Some are

perfectly circular, others only nearly circular. Still others seem almost flattened into lines. And we

have everything in between. In this world things beneficial to us are a random selection from the

full population of possible elliptical beings, and the same with things dangerous to us. Thus there

are no simple and obvious perceptual cues that separate good things from bad things. A very good

elliptical being may differ from a very bad being in a very minor way, difficult to detect. Such a

world would be all but impossible to live in.

That is not the world we have. Yes, there are cases where small differences are critical. But they

don’t dominate. Our world is intelligible. Plants are distinctly different from animals, tigers from

mice, oaks from petunias, rocks and water are not at all alike, and so on. It is thus possible to

construct a conceptual system capable of navigating in the external world so as to preserve and

even enhance the integrity of the internal milieu. That, I believe, is what Gärdenfors is looking at

when he talks of dimensionality and conceptual spaces. Conceptual spaces capture the variety in

the world in a way that nervous systems can compute over it. The metaphysical structure of the

world thus lies in the correspondence of language with the world.

I am, at least provisionally, calling that correspondence the metaphysical structure of the world.

Moreover, since humans did not arise de novo that metaphysical structure must necessarily extend

through the animal kingdom and, who knows, plants as well.

34

Peter Gärdenfors, An Epigenetic Approach to Semantic Categories, IEEE Transactions on Cognitive and

Developmental Systems (Volume: 12 , Issue: 2, June 2020 ) 139 – 147. DOI: 10.1109/TCDS.2018.2833387

(sci-hub link, https://sci-hub.tw/10.1109/TCDS.2018.2833387)

24

“How”, you might ask, “does this metaphysical structure of the world differ from the world’s

physical structure?” I will say, again provisionally, that it is a matter of intension rather than extension.

Extensionally the physical and the metaphysical are one and the same. But intensionally, they are

different. We think about them in different terms. We ask different things of them. They have

different conceptual affordances. The physical world is meaningless; it is simply there. It is in the

metaphysical world that we seek meaning.

Interlude, a little dialog

Does this make sense, philosophically? How would I know?

I get it, you’re just making this up.

Right.

Hmmmm… How does this relate to that object-oriented ontology stuff you

were so interested in a couple of years ago?

35

Interesting question. Why don’t you think about it and get back to me.

I mean, that metaphysical structure you’re talking about, it seems

almost like a complex multidimensional tissue binding the world

together. It has a whiff of a Latourian actor-network about it.

Hmmm… Set that aside for awhile. I want to go somewhere else.

Still on GPT-3, eh?

You got it.

World, Mind, and Text

Text reflects this learnable, this metaphysical, structure, albeit at some remove:

35

See, for example, William Benzon, Living with Abundance in a Pluralist Cosmos: Some Metaphysical Sketches,

Working Paper, January 2013, 87 pp.,

https://www.academia.edu/4066568/Living_with_Abundance_in_a_Pluralist_Cosmos_Some_Metaphysi

cal_Sketches.

25

Learning engines are learning the structure inherent in the text. But that learnable structure is not

explicit in the language model created by the learning engine.

There are two things in play: 1) the fact that the text is learnable, and 2) that it is learnable by a

statistical process. How are these two related?

If we already had an explicit ‘old school’ propositional model in computable form, then we

wouldn’t need statistical learning at all. We could just run the propositional model over the corpus

and encode the result. But why do even that? If we can read the corpus with the propositional

model, in a simulation of human reading, then there’s no need to encode it at all. Just read

whatever aspect of the corpus is needed at the time.

So, statistical learning is a substitute for the lack of a usable propositional model. The statistical

model does work, but at the expense of explicitness.

But why does the statistical model work at all? That’s the question.

It’s not enough to say, because the world itself is learnable. That’s true for the propositional

model as well. Both work because the world is learnable.

BUT: Humans don’t learn the world with a statistical model. We learn it through a propositional

engine floating over an analogue or quasi-analogue engine with statistical properties. And it is the

propositional engine that allows us to produce language. A corpus is a product of the action of

propositional engine, not a statistical model, acting on the world.

Description is one basic such action; narration is another. Analysis and explanation are perhaps

more sophisticated and depend on (logically) prior description and narration. Note that this

process of rendering into language is inherently and necessarily a temporal one. The order in

which signifiers are placed into the speech stream depends in some way, not necessarily obvious, on

the relations among the correlative signifieds in semantic or cognitive space. Distances between

signifiers in the speech stream reflect distances between correlative signifieds in semantic space.

We thus have systematic relationships between positions and distances of signifiers in the speech stream, on the one

hand, and positions and distances of signifieds in semantic space. It is those systematic relationships that allow

statistical analysis of the speech stream to reconstruct semantic space.

Note that time is not extrinsic to this process. Time is intrinsic and constitutive of computation. Speaking

involves computation, for it is language, as does the statistical analysis of the speech stream.

The propositional engine learns the world via Gärdenfors’ dimensions, and whatever else,

Powers’ stack for example.

36

Those dimensions are implicit in the resulting propositional model

and so become projected onto the speech stream via syntax, pragmatics, and discourse structure.

The language engine is then able to extract (a simulacrum of) those dimensions through statistical

learning. Those dimensions are expressed in the parameter weights of the model. THAT’s what

makes the knowledge so ‘frozen’. One has to cue it with actual speech.

36

William Powers, Behavior: The Control of Perception (Aldine) 1973. A decade later David Hays integrated

Powers’ model into his cognitive network model, David G. Hays, Cognitive Structures, HRAF Press, 1981.

26

The whole language model thus functions as associative memory.

37

You present it with an input

cue, and it then associates from that cue and emits tokens which then project back into the model,

curing other tokens, and so forth.

Waterloo or Rubicon? Beyond blind success

We now have a way of beginning to think about just why language models such as GPT-3 can be

so successful even though, as I argued earlier, they have no direct access to the realm of signifieds,

of meaning. This argument does not change that. Whatever GPT-3 accomplishes, it does so on

the strength of structural relationships between the signifiers which it accessed in the corpus and the

signifieds in the minds of the people who produced the texts in that corpus. Its success is a

triumph of formalism devoid of meaning. Meaning requires access to the world.

And yet, if meaning inheres in relationships, as Lamb has argued, then those relationships exist in

the model even as the model is isolated from the world. But we are creatures of language. We

generate it and it generates us. Is it so very strange that an elaborate network of relationships

among bare naked signifiers should evoke the tantalizing prospect of flesh and blood

interlocutors?

What happens next with GPT-3 and other such models? At the moment they represent the

success of not-so-blind groping extending back to Gerard Salton’s first experiments with vector

semantics. As far as I can tell the creators of such models have little commitment to some theory

of how language and mind work. They may well know that the neurons in their networks

resemble real neurons about as much as a smiley face resembles the Mona Lisa, that their layers

have only a passing resemblance to the structure of the cerebral cortex; they may well have taken

Linguistics 101, and so forth. But thinking in those terms is not central to their work; they do not

arrive at the computer with a well thought-out account of mind-brain-world-and-language. That’s

not what they’re trying to figure out. They’re trying to figure out how to get computers to

produce simulacra of human language and cognitive behavior. They do that very well. As for

their remarkable results, they don’t know how their engines achieve them. They know only that

they do. And they no doubt are full of ideas about how to modify those engines so they do better.

Remarkable as the results have been, I do not see this as a long-term strategy for success. Students

of symbolic systems know a lot about how they work, though many details are in dispute. How

can that knowledge be brought to bear on the construction and operation of (large scale) statistical

properties of language? I have suggested a framework in which that can be done – a fairly specific

suggestion about Neubig’s “isomorphic transform onto meaning space” to be amplified extended

with Gärdenfors’ conceptual spaces. It is only a beginning. Such a framework would be useful in

looking under the hood to examine the mechanics of these models so that we can improve them.

When I talk of GPT-3 as a crossing of the Rubicon, that is what I mean. Given a way of thinking

about how such models operate, we are at the threshold of even more remarkable developments.

But if the AI community refuses to develop such a framework then I fear that their work will,

sooner or later, crash and burn, as machine translation did in the mid 1960s, and as symbolic

37

The idea that the brain implements associative memory in a holographic fashion was championed by

Karl Pribram in the 1970s and 1980s. David Hays and I drew on that work in an article on metaphor,

William Benzon and David Hays, Metaphor, Recognition, and Neural Process, The American Journal of

Semiotics , Vol. 5, No. 1 (1987), 59-80,

https://www.academia.edu/238608/Metaphor_Recognition_and_Neural_Process.

27

computation did in the mid 1980s. They will have met yet another Waterloo, snatching defeat –

to change metaphors in mid-stream – from success.

We have no choice but to move forward.

Unless, of course, the investors chicken out.

28

6. Gestalt switch: GPT-3 as a model of the mind

Here are some key paragraphs from section three; note the underlined sections:

Let us notice, first of all, that language exists as strings of signifiers in the external world.

In the case that interests us, those are strings of written characters that have been

encoded into computer-readable form. Let us assume that the signifieds – which bear a

major portion of meaning, no? – exist in some high dimensional network in mental

space. This is, of course, an abstract space rather than the physical space of neurons,

which is necessarily three dimensional. However many dimensions this mental space

has, each signified exists at some point in that space and, as such, we can specify that

point by a vector containing its value along each dimension.

What happens when one writes? Well, one produces a string of signifiers. The distance

between signifiers on this string, and their ordering relative to one another, are a function

of the relative distances and orientations of their associated signifieds in mental space.

That’s where to look for Neubig’s isometric transform into meaning space. What GPT-3,

and other NLP engines, does is to examine the distances and ordering of signifiers in the string and

compute over them so as to reverse engineer the distances and orientations of the associated signifieds in

high-dimensional mental space.

The purpose of this section is simply to underline the seriousness of my assertion to treat the mind

as a high-dimensional space and that, therefore, we should treat the high-dimensional parameter

space of GPT-3 as a model of the mind. If you aren't comfortable with the idea, well, it takes a bit

of time for it to settle down. This section is a way of occupying some of that time.

If it’s not a model of the mind, after all, then what IS it a model of? “The language”, you say?

Where does the language come from, where does it reside? “The mind”, that’s right.

It is certainly not a complete model of the mind. The mind, for example, is quite fluid, is capable

of autonomous action, has access to the physical world, and is deeply social. GPT-3 seems static

and is only reactive. It cannot initiate action, has no direct access to the external world, and has

little capacity for social interaction. Nonetheless, it is still a rich model.

I built plastic models as a kid, models of rockets, of people, and of sailing ships. None of those

models completely captured the things they modeled. I was quite clear on that. I have a cousin

who builds museum-class ship models from wood of various kinds, metal, cloth, paper, thread and

twine (and perhaps some plastic here and there). They are much more accurate and aesthetically

pleasing than the models I assembled from plastic kits as a kid. But they are still only models.

So it is with GPT-3. It is a model of the mind. We need to get used to thinking of it in those terms,

dangerous as they may be. But, really, can the field get more narcissistic and hubristic than it

already is?

* * * * *

This is not the first time I’ve been through this drill. I’ve been thinking about this that and the

other in the so-called digital humanities since 2014. Call it computational criticism. These

particular investigators had been using various kinds of distributional semantics – topic modeling,

29

vector space semantics – to examine literary texts and populations of texts. They don’t think

about their language models as models of the mind; they’re just, well, you know, language models,

models of texts. There’s some kind membrane, some kind of barrier, that keeps us – them, me,

you – from moving from these statistical models of texts to thinking of them as models of the mind

that produced the texts. They’re not the real thing, they’re stop gaps, approximations.

Yes, they are. And they are also models, as much models of the mind as a plastic schooner is a

model of the America.

* * * * *

I’m suggesting we need to perform a gestalt switch. As long as we think of the statistical object as a

poor cousin of what we’re really

interested in – language, the mind – we

see it as rabbit. But then it starts walking

like a duck and quacking like one.

Shazaam! It’s a duck.

Why am I saying this? Like I said, to

underline the seriousness of my assertion

to treat the mind as a high-dimensional

space. In a common formulation, the mind

is what the brain does. The brain is a three-

dimensional physical object.

It consists of roughly 86 billion neurons,

each of which has roughly 10,000 connections with other neurons. The action at each of those

synaptic junctures is mediated by upward of 100 neurochemicals. The number of states a system

can take depends on 1) the number of elements it has, 2) the number of states each element can

take, and 3) the dependencies among those elements. How many states can that system assume?

We don't really know. Jillions, maybe zillions, maybe jillions of zillions. A lot.

That is a state space of very high dimensionality. That state space is the mind. GPT-3 is a model

of that. Compared to a jillion zillion possible mental states, a 175 billion parameters is peanuts.

30

7. Engineered intelligence at liberty in the world

It’s time to wrap things up. To do so I will quote some passages from a recent blog post by David

Ferrucci. Full disclosure: I know Ferrucci, though not well. I haven’t seen or talked with him in

decades but we do exchange emails every few years. Back in the early to mid-1980s I was on the

faculty at the Rensselaer Polytechnic Institute in the Department of Language, Literature, and

Communications. Ferrucci was, I believe, getting a master’s degree in Computer Science. We

worked with the late Geoff Goldbogen on a project to evaluate the uses of AI for manufacturing.

Ferrucci ended up working with IBM while also collaborating with Selmer Bringsjord (Cognitive

Science at RPI) on a story generator called BRUTUS.

38

He then assembled the team at IBM that

created Watson, the computer system that beat humans at Jeopardy in February 2011. He went

on to found Elemental Cognition.

39

Ferrucci says

In a blog post on July 30, 2020

40

Ferrucci observes:

A language model is just a string probability guesser. Its superpower is to look at a string

of text—a word, a sentence, a paragraph—and guess how likely it is that a human

would write that string. To make these guesses, language models analyze mounds of text

in search of statistical patterns, such as what words tend to appear near what other

words, or how key terms repeat throughout a paragraph.

Yes. That’s all GPT-3 does and that’s all GPT-3 can do. Ferrucci goes on to observe that this is a

tremendously useful skill. However:

But ultimately, NLP aims higher. We want machines to understand what they read, and

to converse, answer questions, and act based on their understanding. So I have to

wonder: how much closer are we now than we were a decade ago, before neural

network language models swept the field?

As impressive as today’s NLP is, I worry that it’s still on a path that comes with severe

limitations. A system’s “understanding” can only go so far when its world consists

entirely of what writers typically say. The concepts we want machines to learn just

aren’t evident in the data we’re giving them.

Yes, language models can learn that humans often write “bowl” near “kitchen.” But

that’s the grand total of what a language model understands about bowls and kitchens.

Everything else that humans know about these objects—that bowls have raised edges,

that bowls often break apart if you drop them, that people go to kitchens when hungry

to find food—is taken for granted. All this context is obvious to us thanks to our shared

experiences, so writers don’t bother to lay it all out.

38

Selmer Bringsjord and David Ferrucci, Artificial Intelligence and Literary Creativity: Inside the Mind of Brutus, A

Storytelling Machine, Psychology Press, 1999.

39

https://www.elementalcognition.com/.

40

David Ferrucci, Can super-parrots ever achieve language understanding? Elemental Cognition website,

accessed August 4, 2020, https://www.elementalcognition.com/super-parrots-blog.

31

Ferrucci was trained in old school symbolic processing, an enterprise in which researchers

devoted a couple decades to developing machine tractable mental models of various domains in

the world – not texts, but the world. The objective was to produce artificial systems that

understand language in some meaningful way. Understanding was grounded in those mental

models. BRUTUS was built on such models. While Watson employed them as well, it also

employed the newer, a shallower, statistical models.

41

As far as I can tell from publically available

material, his approach at Elemental Cognition is eclectic as well, though the objective is more

general than the question-answering that governed Watson’s architecture and so will require a

different architecture.

Interlude in a Chinese room

Yet if you would believe John Searle, no matter how rich and detailed those old school mental

models, understanding would necessarily elude them. I am referring, of course, to his (in)famous

Chinese Room argument.

42

When I first encountered it years ago my reaction was something like:

interesting, but irrelevant. Why irrelevant? Because it said absolutely nothing about the techniques AI

or cognitive science investigators used and so would provide no guidance toward improving that

work. He did, however, have a point: If the machine has no contact with the world, how can it

possibly be said to understand anything at all? All it does is grind away on syntax.

What Searle misses, though, is the way in which meaning is a function of relations among

concepts, as I pointed out earlier (pp. 17 ff.). It seems to me, however – and here I’m just making

this up – we can think of meaning as having both an intentional aspect, the connection of signs to the

world, and a relational aspect, the relations of signs among themselves. Searle’s argument

concentrated on the former and said nothing about the latter.

What of the intentional aspect when a person is writing or talking about things not immediately

present, which is, after all quite common? In this case the intentional aspect of meaning is not

supported by the immediate world. Language use thus must necessarily be driven entirely by the

relations signifiers have among themselves, Sydney Lamb’s point which we have already

investigated (p. 17).

In this respect, however, it is not obvious to me that there is any difference between a system such

as DPT-3, which is utterly lacking in mental models, and old school symbolic systems, which were

built on them, and an eclectic system, such as Ferrucci proposes – but also Rodney Brooks and

Gary Marcus of Robust AI. What then is the value of having mental models?

Gaining control

Let us recall an earlier formulation (from p. 10) where we noted the GPT-3 was defined over a

huge corpus of language strings. Those strings were created by people making their way in the

world and thus expresses both their intentionality, which is directed at the world, and their

relationality, which is inherent in their minds, mental models plus language (syntax, morphology,

etc.). Those strings reflect both mind and world. The text corpora supporting NLP engines (such as

GPT-3) thus intertwine both the intentional aspect of meaning and the relational.

41

David Ferrucci, Eric Brown, Jennifer Chu-Carroll, et al. Building Watson: An Overview of the DeepQA

Project, AI Magazine, Fall 2010, pp. 59-79.

42

I have written a number of blog posts about this argument. Here’s one of them: Another romp around

Searle’s Chinese room, New Savanna, blog post, July 18, 2018, http://new-

savanna.blogspot.com/2018/07/another-romp-around-searles-chinese-room.html. You can find others at

the Searle link, which, however, contains other Searle posts as well, http://new-

savanna.blogspot.com/search/label/Searle.

32

The function of providing a machine with a mental model is, in effect, to liberate it from that

entanglement. It is that entanglement that limits GPT-3 to guessing, to What’s next?

I know nothing of Ferrucci’s technical approach (more likely, approaches) to integrating symbolic,

or deep semantics (to use the language from the Watson paper) based on mental models, and

shallow semantics, based on statistical models. It is the mere fact of such integration that is

important. For the relational information that is implicit in GPT-3’s language model is opaque to

outsiders, and cannot be manipulated directly. All one can do is submit a prompt and have GPT-

3 continue on, word after word after predicted word.

As Ferrucci says, we need to do better than that.

And we can.

Language mirrors the world

Let us return to Neubig’s “isometric transform” onto meaning space (pp. 18 ff.). I began

explicating it in terms of the relationship between strings of signifiers in a corpus and a high

dimensional organization of signifieds in mental space. But doesn’t that relationship, that

transform, ultimately exist between the world and mental space?

For, as I’ve said many times before, those strings arise through the interaction of mental space

and the world: people making their way in the world through writing. The mind mirrors the

world. No more, no less. We have arrived back at the metaphysical structure of the world (pp. 23

ff.)

How could it be otherwise? The human mind is the product of millions of years of evolution, of

animals making their way in the world. Our sensorium is adapted to perceiving the world and our

motor system is adapted to moving about in the world. Perception and action are interrelated by

cognition and thought, the mind.

Natural language AIs cannot compete with the human sensorimotor system no matter how much

text they train on. Too much information is missing from the text. No doubt some of the gap can

be made up by variously hand-crafted augmentations and by having humans constantly interact

in partnership – as Ferrucci and his team are doing at Elemental Cognition.

43

Then we have robots, which I haven’t discussed at all. Robots do perceive and move about in the

world. And, while we can equip robots with both perceptual and motor powers that we do not

have, those powers operate in very restricted domains. We do not know how to endow robots

with our sensorimotor capabilities. Our robots must necessarily remain strangers in the land.

The natural domain for an AI would be the digital world, that is the world in which an artificial

intelligence is a native. How do we endow an AI with the capacity to learn about and operate in a

purely digital world, and to what end?

With that question my line of thought in this working paper comes to an end. That is something I

intend to take up in a later working paper.

43

Elemental Cognition, “Continuous Human-Machine Collaboration”, accessed August 5, 2020,

https://www.elementalcognition.com/technology. See also his talk at the Allen Institute for AI in 2014,

https://youtu.be/F_0hpnLdNjk.

33

What’s next?

This working paper is at an end. As I indicated at the very beginning, this effort began with a long

comment I posted at Tyler Cowen’s blog, Marginal Revolution. This working paper has covered

the first two paragraphs in that comment. Here is the rest of that comment:

Think AI as platform, not feature (Andreessen).

44

Obvious implication, the basic

computer will be an AI-as-platform. Every human will get their own as a very young

child. They're grow with it; it'll grow with them. The child will care for it as with a pet.

Hence we have ethical obligations to them. As the child grows, so does the pet – the pet

will likely have to migrate to other physical platforms from time to time.

Machine learning was the key breakthrough. Rodney Brooks’ Gengis, with its

subsumption architecture, was a key development as well, for it was directed at robots

moving about in the world. FWIW Brooks has teamed up with Gary Marcus and they

think we need to add some old school symbolic computing into the mix. I think they’re

right.

Machines, however, have a hard time learning the natural world as humans do. We're

born primed to deal with that world with millions of years of evolutionary history

behind us. Machines, alas, are a blank slate.

The native environment for computers is, of course, the computational environment.

That's where to apply machine learning. Note that writing code is one of GPT-3's skills.

So, the AGI of the future, let's call it GPT-42, will be looking in two directions, toward

the world of computers and toward the human world. It will be learning in both, but in

different styles and to different ends. In its interaction with other artificial computational

entities GPT-42 is in its native milieu. In its interaction with us, well, we'll necessarily be

in the driver's seat.

Where are we with respect to the hockey stick growth curve? For the last 3/4 quarters of

a century, since the end of WWII, we've been moving horizontally, along a plateau,

developing tech. GPT-3 is one signal that we've reached the toe of the next curve. But

to move up the curve, as I've said, we have to rethink the whole shebang.

We're IN the Singularity. Here be dragons.

[Superintelligent computers emerging out of the FOOM is bullshit.]

When I posted the first version of this working paper in August of 2020 I had intended to cover

that material in another series of posts which I would then consolidate into a working paper

tentatively entitled, After GPT-X: The Star Trek computer, and beyond. I started down

that path, wrote a number of posts, planned another working paper or three, but never made it to

the end. Still, that seems like a worthy objective but I’ve just taken another path. So...

To the Star Trek computer, and beyond.

44

Is AI a feature or a platform? [machine learning, artificial neural nets], New Savanna, blog post,

December 13, 2019, https://new-savanna.blogspot.com/2019/12/is-ai-feature-or-platfrom-machine.html.

34

Appendix: Semanticity, adhesion and relationality

Let’s review a passage where I discuss Searle’s Chinese Room thought-experiment (p. 31):

Yet if you would believe John Searle, no matter how rich and detailed those old school

mental models, understanding would necessarily elude them. I am referring, of course,

to his (in)famous Chinese Room argument. When I first encountered it years ago my

reaction was something like: interesting, but irrelevant. Why irrelevant? Because it said

absolutely nothing about the techniques AI or cognitive science investigators used and

so would provide no guidance toward improving that work. He did, however, have a

point: If the machine has no contact with the world, how can it possibly be said to

understand anything at all? All it does is grind away on syntax.

What Searle misses, though, is the way in which meaning is a function of relations

among concepts, as I pointed out earlier (pp. 17 ff.). It seems to me, however – and here

I’m just making this up – we can think of meaning as having both an intentional aspect, the

connection of signs to the world, and a relational aspect, the relations of signs among

themselves. Searle’s argument concentrated on the former and said nothing about the

latter.

What of the intentional aspect when a person is writing or talking about things not

immediately present, which is, after all quite common? In this case the intentional

aspect of meaning is not supported by the immediate world. Language use thus must

necessarily be driven entirely by the relations signifiers have among themselves, Sydney

Lamb’s point which we have already investigated (p. 17).

Those statistics are grabbing onto the relational aspect of meaning. The question is: How much of

that can these methods recover from texts? Let’s set that aside for the moment.

Intention, relationality, and adhesion

That passage mentions intention and relation. Intention resides in the relationship between a person

and the world. Relation resides in the relationships that signifiers have among themselves. It is a

property of the cognitive system. I am now thinking that it must be paired with adhesion. Taken

together they constitute semanticity. Thus we have semanticity and intention where semanticity is a

general capacity inherent in the cognitive system, in a person’s mind, and intention inheres in the

relation between a person and the world in a particular perceptual and/or cognitive activity.

What do I mean by adhesion? Adhesion is how words ‘cling’ to the world while relationality is the

differential interaction of words among themselves within the linguistic system. Words whose

meaning is defined directly over the physical world, but also, to some extent, the interpersonal

world of signals and feeling, they adhere to the world through sensorimotor schemas. Words

whose meaning is abstract are more problematic. Their adhesion operates though patterns of

words and other signs and symbols (e.g. mathematics, data visualizations, illustrative diagrams of

various kinds, and so forth). Teasing out these systems of adhesion has just barely begun.

The psychologist J.J. Gibson talked of the affordances an environment presents to the organism.

Affordances as the features of the world which an organism can readily pick up during its life in

the world. Adhesions are the organism’s complement to environmental affordances; they are the

perceptual devices through which the organism relates to the affordances.

35

What this means for language models

Large language models built through deep neural networks, such as GPT-3, conflate the

interaction of three phenomena: 1) the world-level relational aspect of semanticity as captured in

the locations of word forms (signifiers) in a string, 2) the conventions of discourse structure, and 3)

the world itself. The world is present in the model because the texts over which the model was

constructed were created by people interacting in the world. They were in an intentional

relationship with the world when they wrote those texts. The conventions of discourse are present

simply because they organize the placement of word forms in a text, with special emphasis on the

long-distance relationships of word. As for relationality, that’s all that can possibly be present in a

text. Adhesions belong to the realm of signifieds, of concepts and ideas, and they aren’t in the text

itself.

Would it somehow be possible to factor a language model into these three aspects? I have no idea.

The point of doing so would be to reduce the overall size of the model.

Putting that aside, let us ask: Given a sufficiently large database of texts and tokens and a high

enough number of parameters for our model, is it possible for a language model to extract all the

relationality from the texts? How much of that multidimensional relational semanticity can be

recovered from strings of word forms? Given a deep enough understanding of how relational

semantics is reflected in the structure of texts, can we calculate what is possible with various text

bases and model parameterization?

To answer those questions we need to have some account of semantic relationality which we can

examine. The models of Old School symbolic AI and computational linguistics provide such

accounts. Many such models have been created. Which ones would we choose as the basis for our

analysis? The sort of question that interests me is how many word forms have their meanings

given in adhesions to the physical world (that is, physical objects and events), to the interpersonal

world (facial expressions, gestures, etc.) and how many word forms are defined abstractly?

So many questions.

Semantic Supervised Training for General Artificial Cognitive Agents

Chapter

Full-text available

Oct 2021

The article describes the authors’ approach to the construction of general-level artificial cognitive agents based on the so-called «semantic supervised learning». Within this approach in accordance with the hybrid paradigm of artificial intelligence, both machine learning methods and methods of the symbolic approach («good old-fashioned artificial intelligence») are used. A description of current problems with understanding of the general meaning and context of situations in which narrow AI agents are found is presented. The definition of semantic supervised learning is given and its relationship with other machine learning methods is described. In addition, a thought experiment is presented, which shows the essence and meaning of semantic training with a teacher, which makes it possible to «educate» a general-level AI agent. It opens the opportunity to apply not only general-level knowledge about the world around the agent, but also introduce the personal experience, which, according to the authors, will lead to a full understanding of the context and, ultimately, to the construction of general artificial cognitive agents. The article also provides a possible architecture for an artificial cognitive agent that can implement semantic supervised learning. The novelty of the work lies in the authors’ approach to combining various methods of artificial intelligence within the hybrid paradigm. The relevance of the work is based on the ever-increasing interest in general-level artificial intelligence methods in science, technology, business and world politics. The article is theoretical. The article will be of interest to specialists in the field of artificial intelligence (especially in the direction of building artificial general intelligence), philosophy of consciousness, and in general to all those who are interested in up-to-date relevant information about the approaches and methods of implementing general artificial cognitive agents.

Semantic supervised training for general artificial cognitive agents

Preprint

Full-text available

Sep 2020

Roman Dushkin

The article describes the author's approach to the construction of general-level artificial cognitive agents based on the so-called «semantic supervised learning», within which, in accordance with the hybrid paradigm of artificial intelligence, both machine learning methods and methods of the symbolic approach and knowledge-based systems are used («good old-fashioned artificial intelligence»). A description of current problems with understanding of the general meaning and context of situations in which narrow AI agents are found is presented. The definition of semantic supervised learning is given and its relationship with other machine learning methods is described. In addition, a thought experiment is presented, which shows the essence and meaning of semantic training with a teacher, which makes it possible to «educate» a general-level AI agent, giving him the opportunity to apply not only general-level knowledge about the world around it, but also personal experience, which, according to the author, will lead to a full understanding of the context and, ultimately, to the construction of general artificial cognitive agents. The article also provides a possible architecture for an artificial cognitive agent that can implement semantic supervised learning. The novelty of the work lies in the author's approach to combining various methods of artificial intelligence within the hybrid paradigm. The relevance of the work is based on the ever-increasing interest in general-level artificial intelligence methods in science, technology, business and world politics. The article is theoretical. The article will be of interest to specialists in the field of artificial intelligence (especially in the direction of building artificial general intelligence), philosophy of consciousness, and in general to all those who are interested in up-to-date relevant information about the approaches and methods of implementing general artificial cognitive agents.

Building Watson: An Overview of the DeepQA Project

Article

Full-text available

Sep 2010
AI MAG

IBM Research undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show, Jeopardy. The extent of the challenge includes fielding a real-time automatic contestant on the show, not merely a laboratory exercise. The Jeopardy Challenge helped us address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After three years of intense research and development by a core team of about 20 researchers, Watson is performing at human expert levels in terms of precision, confidence, and speed at the Jeopardy quiz show. Our results strongly suggest that DeepQA is an effective and extensible architecture that can be used as a foundation for combining, deploying, evaluating, and advancing a wide range of algorithmic techniques to rapidly advance the field of question answering (QA).

Episodic and Semantic Memory

Chapter

Jan 1972

Endel Tulving

How Many Words Are There In The English Language?

Aug 2020

How Many Words Are There In The English Language?" Dictionary.com, accessed August 1, 2020, https://www.dictionary.com/e/how-many-words-in-english/.

I've explored the notion of texts as paths in semantic space

Oct 2018

I've explored the notion of texts as paths in semantic space, William Benzon, Virtual Reading: The Prospero Project Redux, Working Paper, Version 2, October 2018, 37 pp., https://www.academia.edu/34551243/Virtual_Reading_The_Prospero_Project_Redux.

Can super-parrots ever achieve language understanding? Elemental Cognition website

Aug 2020

David Ferrucci

David Ferrucci, Can super-parrots ever achieve language understanding? Elemental Cognition website, accessed August 4, 2020, https://www.elementalcognition.com/super-parrots-blog.

Chapter

Semantic/Content Analysis/Natural Language Processing

January 2017

Paul Nulty

Chapter

Language and Robotics: Complex Sentence Understanding

August 2019

Existing robotic systems can take actions based on natural language commands but they tend to be only simple commands. On the other hand, in the domain of Natural Language Processing (NLP), complex sentences are processed, but this NLP domain does not make close contact with robotics. The beginning of computer processing of natural language, when traced back to a system such as Winograd’s ... [Show full abstract] SHRUDLU, conceived in 1973, actually aimed to address the issues of Natural Language Understanding (NLU) of relatively complex sentences by a robotic system which in turn takes actions accordingly based on the natural language input. NLU, in the robotic context, thus constitutes taking the correct actions from language instructions. This paper explores the use of cognitive linguistic constructs as well as other constructs such as spatial relationship constructs to configure an NLU system for translating complex natural language instructions into actions to be taken by a robot. This research work illustrates that two important steps are necessary: the first step is to translate a language-dependent surface sentential structure into a language independent deep-level predicate representation, and then the next step is to translate the predicate representation into grounded real-world references and constructs that enable a robot to carry out the language instructions accordingly.

Article

Getting to Know Named Entity Recognition: Better Information Retrieval

May 2024 · Medical Reference Services Quarterly

Borui Zhang

Preprint

Full-text available

Discursive Competence in ChatGPT, Part 1: Talking with Dragons, Version 2

January 2023

William Benzon

Taken together, Noam Chomsky’s idea of linguistic competence and David Marr’s conception of three levels of analysis for information systems, suggest a new approach to understanding how LLMs work. This approach requires careful analysis of text. Such analysis indicates that ChatGPT has explicit control over sophisticated discourse skills: 1) It possesses the capacity to specify high-level ... [Show full abstract] structures that regulate the organization of language strings into specific patterns: e.g. conversational turn-taking, story frames, film interpretation, and metalingual definition of abstract concepts. 2) It is capable of analogical reasoning in the interpretation of films and stories, such as Spielberg’s Jaws and A.I., and Tezuka’s Astro Boy stories. It must establish an analogy between some abstract interpretive theory (e.g. the ideas of Rene Girard) and people and events in a story. 3) It has some understanding of abstract concepts such as justice and charity. Such concepts can be defined over concepts that exhibit them (metalingual definition). ChatGPT recognizes suitable stories and can revise them. 4) ChatGPT can adjust its level of discourse to accommodate children of various ages. Finally, much of ChatGPT’s discourse seems formulaic in a way similar to what Parry/Lord found in oral epic.