PreprintPDF Available

Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

June 2024

June 2024

License
CC BY-NC-ND 4.0

Authors:

Moscow Institute of Physics and Technology

Even though Transformers are extensively used for Natural Language Processing tasks, especially for machine translation, they lack an explicit memory to store key concepts of processed texts. This paper explores the properties of the content of symbolic working memory added to the Transformer model decoder. Such working memory enhances the quality of model predictions in machine translation task and works as a neural-symbolic representation of information that is important for the model to make correct translations. The study of memory content revealed that translated text keywords are stored in the working memory, pointing to the relevance of memory content to the processed text. Also, the diversity of tokens and parts of speech stored in memory correlates with the complexity of the corpora for machine translation task.

Average working memory diversity measured after each epoch of fine-tuning. The dashed lines are linear least squares fits. The plot confirms that during fine-tuning, the working memory content becomes more uniform. The minimal number of unique memory tokens is larger for more complex texts (IT docs and WSC) than for simpler texts (Open Subtitles and TED).

…

Figures - available via license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

Content may be subject to copyright.

Available via license: CC BY-NC-ND 4.0

Content may be subject to copyright.

Complexity of Symbolic Representation in Working

Memory of Transformer Correlates with the Complexity

of a Task

Alsu Sagirovaa,∗

, Mikhail Burtseva,b

aNeural Networks and Deep Learning Lab,

Moscow Institute of Physics and Technology

Institutskiy pereulok, 9, Dolgoprudny, 141701, Russia

bAIRI

4B 08, Kutuzovsky prospect 32 build. 1, Moscow, 121170, Russia

Abstract

Even though Transformers are extensively used for Natural Language Process-

ing tasks, especially for machine translation, they lack an explicit memory to

store key concepts of processed texts. This paper explores the properties of the

content of symbolic working memory added to the Transformer model decoder.

Such working memory enhances the quality of model predictions in machine

translation task and works as a neural-symbolic representation of information

that is important for the model to make correct translations. The study of

memory content revealed that translated text keywords are stored in the work-

ing memory, pointing to the relevance of memory content to the processed text.

Also, the diversity of tokens and parts of speech stored in memory correlates

with the complexity of the corpora for machine translation task.

Keywords: Neuro-Symbolic Representation, Transformer, Working Memory,

Machine Translation

1. Introduction

Working memory is a theoretical concept from neuroscience that repre-

sents a cognitive system that temporarily stores information and manipulates it

(Miyake & Shah, 1999). While neural network models successfully tackle vari-

ous tasks in diﬀerent ﬁelds of artiﬁcial intelligence and are robust to the data

ﬂaws, they still lack interpretability and generalization. Symbolic methods re-

fer to explicit symbol data representation and processing. Therefore, symbolic

methods are explainable and easy to check for correctness. Combining the neu-

∗Corresponding author

Email addresses: alsu.sagirova@phystech.edu (Alsu Sagirova), burtcev.ms@mipt.ru

(Mikhail Burtsev)

arXiv:2406.14213v1 [cs.CL] 20 Jun 2024

ral network approach with symbolic components should empower neural model

decision-making procedures and enhance the overall model performance.

Today neuro-symbolic architectures provide foundation for the area of Nat-

ural Language Processing (NLP). Here, the dominating approach is to train a

model to encode a symbolic input into internal vector representations accumu-

lated in some memory and then decode the content of this memory back to

a sequence of symbols. The majority of NLP models have no explicit mem-

ory storage for the information missing in the input sequence. But for better

processing of a text, additional contextual knowledge should be helpful. We

hypothesize that operating with such knowledge associated with an input but

not explicitly presented in it should help the model to understand the processed

text conceptually and improve the quality of predictions. We propose to add

symbolic working memory to the Transformer decoder to generate and store

contextual knowledge in a textual form to simulate some sort of "inner speech."

2. Related work

The use of dedicated memory storage in neural network models to store and

retrieve explicit or implicit vector representations required for computations is

studied in the Memory-augmented neural networks (MANN) research area. The

simplest examples of memory-augmented neural network models are Recurrent

Neural Network (RNN) and Long-Short Term Memory (LSTM) (Hochreiter &

Schmidhuber, 1997), where the internal memory is represented by the model’s

hidden states, which summarize the input history and are controlled by the

gates.

External memory was implemented in the Neural Turing Machine (NTM)

(Graves et al., 2014) and its successor, the Diﬀerentiable Neural Computer

(DNC) (Graves et al., 2016). The NTM has a memory matrix of a ﬁxed size con-

trolled by the neural network to store real-valued vector representations. The

controller network receives inputs and produces outputs, and it also manipulates

memory with parallel read and write heads.

The DNC is an end-to-end diﬀerentiable extension of an NTM that uses the

random-access memory concept. The controller network employs three attention

mechanisms in read and write heads to interact with memory. The ﬁrst mech-

anism is a content lookup, where the attention weights are based on a cosine

similarity between the controller-generated key vector and each memory slot.

The second mechanism is writing via dynamic memory allocation when the con-

troller can free memory slots that are no longer required. The third mechanism

is a temporal memory linkage used for sequential reading from memory slots.

In the Sparse DNC model (Rae et al., 2016), the authors test randomized k-d

trees and locality sensitive hash (LSH) algorithms to make memory addressing

sparse.

Memory Networks (Weston et al., 2015) and End-to-End Memory Networks

(Sukhbaatar et al., 2015) employ a recurrent attention mechanism for reading

memory to solve a question answering (QA) task. The input sequence embed-

ding is stored in the memory and then matched with the query embedding to

obtain attention scores. These scores are then applied to another representation

of the input sequence to give a response vector. The model handles multi-hop

memory updates by combining the previous layer input representation and the

response vector into the current layer input representation. Hierarchical Mem-

ory Networks (HMNs) (Chandar et al., 2016) organize memory cells into a hi-

erarchical structure to ease computation compared with the soft attention over

ﬂat memory. Also, unlike Memory Networks, which use soft attention entirely,

HMNs apply soft attention for a subset of memory slots selected by a mechanism

based on Maximum Inner Product Search (MIPS).

Dynamic NTMs (D-NTMs) (Gulcehre et al., 2017b) extend an NTM model

with a trainable scheme for memory addressing to allow various soft and hard

attention mechanisms to read from memory. Reading is done in a multi-hop

manner compared with the multi-head reading in NTMs. Also, feedforward and

Gated Recurrent Unit (GRU) controller networks are tested.

The TARDIS (Gulcehre et al., 2017a) memory structure is similar to NTMs

and D-NTMs. For more eﬃcient gradient propagation, memory is considered as

storage for wormhole connections. Read and write operations are implemented

with discrete addressing, and once memory is full, a heuristic is used for memory

read and write operations.

The Global Context Layer (GCL) (Meng & Rumshisky, 2018) incorporates

the global context information into memory. The reading mechanism has sep-

arated address and content parts to ease the training. Compared to the NTM,

GCL does not interpolate the attention vector at the current time step with the

one from the previous time step but completely ignores it.

Transformer (Vaswani et al., 2017) is an encoder-decoder neural network

model, which successfully solves various natural language processing tasks (Raf-

fel et al., 2020). The model is based on the attention mechanism that uses

the information about the entire processed sequence to predict the next token.

The standard Transformer architecture manipulates only the representations

of the elements of the input sequence. There are also Transformer-based neu-

ral network architectures that incorporate external memory. The Extended

Transformer Construction (ETC) (Ainslie et al., 2020) employs a global-local

attention mechanism to handle long inputs. The model input is separated into

two parts: long input, which is a standard Transformer input sequence, and

a small set of auxiliary tokens called global input. Hidden representations of

the global input tokens store summarized information about sets of long input

tokens. Each part of the input is associated with its type of attention: full self-

attention between global input tokens, full cross-attention between global and

long inputs, and self-attention restricted to a ﬁxed radius for long input tokens.

BigBird (Zaheer et al., 2020) and Longformer (Beltagy et al., 2020) sparsify

attention from a quadratic to linear dependencе on the sequence length by serv-

ing a number of pre-selected input tokens to store global representations. Global

tokens are allowed to attend the entire sequence and, as a result, accumulate

and redistribute global information. In addition, BigBird can use extra tokens

to preserve contextual information.

An extension of the input sequence with extra memory-dedicated tokens

is implemented in MemTransformer, MemCtrl Transformer, and MemBottle-

neck Transformer (Burtsev et al., 2021). In MemTransformer, specially re-

served memory tokens are concatenated with the encoder input sequence to

form the Transformer input. The model uses full self-attention over the memory-

augmented input sequence and processes inputs in a standard way. MemCtrl

Transformer uses the same memory-augmented input sequence as MemTrans-

former and has a sub-network to control memory and original input sequence

tokens separately. In MemBottleneck Transformer, full attention is allowed be-

tween the input and the memory only. So, to update the model, we ﬁrstly update

memory representations, as presented in MemCtrl Transformer, and secondly

update the sequence representation.

In our symbolic working memory model studied in this paper, input en-

coding follows the vanilla Transformer, but during the generation of an output

sequence, a decoder decides whether to write the next token to the internal

working memory or the output target prediction. Memory elements are repre-

sented with the same embeddings as for the tokens from the vocabulary to retain

memory interpretability. Working memory tokens are processed as any other

elements of the decoder input sequence, which allows the decoder to attend to

both target and memory tokens.

Working memory elements are tokens from the vocabulary, representing

natural language words or subwords, which makes memory content more ex-

plainable. We can juxtapose the working memory content with golden target

sentences and model target predictions and search for insights about the model

decision-making process. We expect that after training, working memory should

exhibit some properties that make it helpful in solving the target task. First,

the working memory content should be related to the content of the target

task. Second, the complexity of the memory content should correlate with the

complexity of the task. In this paper, we study these properties of the work-

ing memory for Russian to English machine translation task on the corpora of

various lexical and grammatical complexity.

We summarize the diﬀerences between the related work and the proposed

method in Table 1.

This work studies symbolic working memory in the Transformer decoder as

a representation of pieces of information chosen by the neural model to make

predictions and examines how this representation is related to the target task

and how working memory improves the model performance. We believe that the

addition of working memory into the Transformer decoder is the ﬁrst attempt

to store interpretable symbolic representations in external memory for a neural

network model.

3. Transformer with Working Memory in Decoder

Transformer is a sequence-to-sequence encoder-decoder model. In a ma-

chine translation task, during inference, the Transformer input consists of two

sequences of tokens: a sentence in the source language and its partial transla-

tion to the target language. The encoder part of the model processes the ﬁrst

Architecture Memory form Memory access

NTM (Graves et al.,

2014), GCL (Meng

& Rumshisky,

2018)

External memory

is presented with

uninterpretable

memory matrix

of ﬁxed size.

NTM updates memory using the copy of memory from the

previous time step. GCL does not use the previous memory

states in the memory update procedure. Reading from the

memory is processed by the recurrent network. GCL, unlike

NTM, uses separated content and address components.

DNC (Graves et al.,

2016), Sparse DNC

(Rae et al., 2016)

Memory is a ma-

trix with dynamic

allocation.

In DNC, writing to the memory and reading from it is done

by diﬀerentiable attention mechanisms. In Sparse DNC,

read and write operations are constrained to combine a con-

stant number of non-zero memory entries.

Memory Networks

(Weston et al.,

2015), End-to-End

Memory Networks

(Sukhbaatar et al.,

2015), HMN (Chan-

dar et al., 2016)

Memory is an ar-

ray storing vec-

tor representation

of the input se-

quence.

Reading from memory in Memory Networks and End-to-

End Memory Networks is done with recurrent attention. In

HMN, memory is hierarchically structured to minimize com-

putation when reading from memory.

D-NTM (Gulcehre

et al., 2017b)

Each memory ma-

trix cell has con-

tent and trainable

address vectors.

Memory addressing is location-based. Memory processing

is done with an NTM-like controller network.

TARDIS (Gulcehre

et al., 2017a)

Memory matrix

of a ﬁxed size is

controlled by an

RNN similarly

to NTMs and

D-NTMs.

TARDIS uses discrete addressing when operating with mem-

ory. Writing information to the memory is done in the se-

quential order analogously to NTMs. When the memory is

ﬁlled up, the access is based on tying the model write and

read heads.

ETC (Ainslie et al.,

2020), BigBird (Za-

heer et al., 2020),

Longformer (Belt-

agy et al., 2020)

Selected tokens

from the encoder

input sequence.

Writing to memory and reading from it is done with speciﬁc

attention patterns.

MemTransformer,

MemCtrl Trans-

former, MemBottle-

neck Transformer

(Burtsev et al.,

2021)

A ﬁxed number of

special tokens is

prepended to the

encoder input se-

quence.

MemTransformer processes memory tokens as standard in-

put tokens. MemCtrl Transformer reads from memory with

the standard self-attention. Memory updates in MemCtrl

Transformer and MemBottleneck Transformer are done with

a special layer. To read from memory, the MemBottleneck

Transformer input sequence attends only to memory.

Transformer with

working memory in

decoder (ours)

A ﬁxed number of

tokens from the

vocabulary is

mixed with the

target input.

The model-generated memory tokens are written to the de-

coder input sequence in positions corresponding to the their

creation time steps. The memory reading mechanism is

standard Transformer multi-head self-attention.

Table 1: Comparison of the related MANN works and the Transformer with working memory.

sequence. Then, the Transformer decoder takes hidden representations from

the encoder output and the piece of already generated translation to predict the

next token.

The standard Transformer decoder is stacked from Nidentical layers. To

process the i-th decoder layer, ﬁrstly, the normalized sum of target inputs Yinp

and their masked multi-head attention scores MHA(Q, K, V, mask )are calcu-

lated:

Aself,i =LN (Yinp +M H A(Yinp , Yinp, Yinp , look_ahead_mask)).(1)

Then the multi-head cross-attention between the sequence representation

Aself,i and the encoder output Efollowed by normalization is done:

Across,i =LN (Aself,i +M HA(Aself,i, E, E )).(2)

The aggregated representation Across,i is updated with a position-wise feed-

forward network FFN(X), then a skip connection and normalization are used:

Dout,i =LN(Across,i +FFN(Across,i)).(3)

To obtain logits, the N-th decoder layer outputs are sent to the ﬁnal dense

layer:

Ypred =Linear(Dout,N ).(4)

In working memory implementation (Sagirova & Burtsev, 2022), memory is

represented by Madditional tokens in the decoder input. The Transformer de-

coder generates, stores, and retrieves Mworking memory tokens in the same way

it predicts the translation sequence. Memory tokens are placed in the decoder

input sequence and processed by the model the same as standard Transformer

decoder input tokens, so during decoding the sequence, the model has full access

to the memory tokens generated so far.

To treat working memory tokens as a part of the Transformer decoder input

sequence, we allow the positions of the memory tokens to be mixed with the

positions of target predictions in the generated sequence. For every predicted

token, the model also predicts whether the token will be stored in working

memory or in the target sequence.

The architecture is depicted in Fig. 1. The model predictions are generated

sequentially, one token at a time. When a newly generated token appears, the

model decides if it is a memory token or the resulting translation token. Thus,

the Transformer decoder input contains target sequence predictions alternating

with memory tokens.

To allow the model to predict the token type and mark it with a dedicated

ﬂag value, we extend the dimensionality of the Transformer ﬁnal layer up to

target_vocabulary_size + 2.Two additional units are used to predict the token

type ﬂag values. The embedding of this ﬂag is added to the corresponding

Figure 1: Transformer with the working memory-augmented decoder. The decoder inputs

are the tokens generated to the moment y1,...,yt−1and the corresponding memory ﬂags

m1,...,mt−1.The memory ﬂag is a binary value: mi= 1 means that yiis the target

prediction token, and mj= 0 means yjis the working memory token. The ﬁnal layer of the

model has an expanded output size =target_vocabulary_siz e + 2. The loss function takes

into account the diﬀerence between the target sequence predictions and the real targets rather

than the memory tokens.

decoder input token embedding on the next decoding step to allow the model

to diﬀerentiate the memory content from the target prediction values.

The procedure of training Transformer with working memory in decoder is

described in Algorithm 1. The model takes as an input the source sequence Xinp,

the ﬁrst token of the decoder input — start-of-sequence token Yinp = (Y1),start-

of-sequence token type Tinp = (T1),ground truth target sequence Yreal,and the

working memory size mem_size. The encoder transforms source sequence Xinp

into representation E. Then, given E, Yinp , Tinp and Yreal ,the decoder generates

the next token Ypred and its type Tpred.According to the value of Tpred and the

number of memory tokens generated to the moment, the currently generated

token is concatenated to Yinp with teacher forcing or just as is in the case of

the token ﬂagged as memory. The predicted token type value is appended to

Tinp.The sequence is generated token-by-token until the memory is full and the

target prediction sequence matches the length of a real target.

For example, if the decoder input sequence is the following:

Y= [Ytar

1, Y tar

2, Y mem

1, Y tar

3, Y mem

2, Y mem

3, Y tar

4],(5)

where Ytar

iare the target prediction tokens and Ymem

jare the tokens stored in

the working memory, then the token type sequence for Ywill look as follows:

T= [1,1,0,1,0,0,1].(6)

The token type vector helps locate the working memory elements in the pre-

dicted sequence.

Algorithm 1 Forward pass of Transformer with working memory in decoder

Require: Xinp, Yinp = (Y1), Tinp = (T1), Yreal , mem_size

E=Encoder(Xinp )

i= 0, mem_num = 0

while len(Yinp)< len(Yr eal) + mem_size do

(Ypred, Tpr ed) = Decoder (E, Yinp, Tinp )

if Tpred == 0 then ▷ Ypr ed token will be stored in memory

if mem_num < mem_size then

Yinp =concat(Yinp, Ypred )

mem_num =mem_num + 1

else

Tpred = 1

end if

if Tpred == 1 then ▷teacher forcing target prediction token

Yinp =concat(Yinp, Yreal [i+ 1])

i=i+ 1

end if

Tinp =concat(Tinp, Tpred )

end while

At the inference, there is no teacher forcing and every token and its type

value generated by the model are stored in the decoder input sequence as is

with the corresponding ﬂag values.

In all experiments reported in this paper, memory size Mis set to 10. To

calculate the loss function during training, we exclude the predicted sequence el-

ements that belong to the working memory. We use diﬀerent decoding strategies

for the target prediction tokens and the working memory content. To decode

target predictions, we use the best path decoding and to obtain memory tokens,

we apply nucleus sampling (Holtzman et al., 2019) with a sampling parameter

pnucleus = 0.9.

4. Datasets for Machine Translation Task

This work aims to study the working memory content, ﬁnd relations between

model predictions and tokens stored in the memory, and explore how the com-

plexity of an input text aﬀects the working memory. For experiments, we used

four datasets collected from diﬀerent natural language domains.

The ﬁrst is the TED Ru-En machine translation dataset1from the TED

Talks Open Translation Project (Ye et al., 2018). The TED dataset is a collec-

tion of transcripts of TED Talks, which are well-prepared speeches for a wide

audience, so the sentences should be unambiguous, easy to understand, and

grammatically correct at the same time.

The second dataset consists of paired sentences from Russian Winograd

Schema Challenge2(RWSD) and the original English Winograd Schema Chal-

lenge (WSC) dataset3(Levesque et al., 2012). The Winograd schemas represent

1https://www.tensorflow.org/datasets/catalog/ted_hrlr_translate

2https://russiansuperglue.com/tasks/task_info/RWSD

3https://cs.nyu.edu/~davise/papers/WinogradSchemas/WS.html

pairs of sentences that diﬀer in only one or two words and contain an ambiguity

that is resolved in opposite ways in two sentences and requires the use of world

knowledge and reasoning for its resolution. Such ambiguity is also a challenge

from the machine translation point of view because the translation system aims

to keep the meaning of the sentence and make correct word choices to result

in an accurate translation. Combining Russian and English versions of the

Winograd schemas was possible because samples in Russian were collected by

manually translating and adapting the original Winograd dataset for Russian.

The translations were also human-assessed.

The other two datasets are from the OPUS project (Tiedemann, 2012) and

are sourced from TensorFlow Datasets4. Open Subtitles is a collection of trans-

lated movie subtitles5. This dataset represents a collection of pairs of spoken

language phrases and lines from movies. Such informal language includes col-

loquialisms, phrasal verbs, contractions that represent another level of transla-

tion complexity compared with the written language, as in WSC, or prepared

speeches, as in TED.

IT documents is a collection of parallel corpora of localization ﬁles for

GNOME, KDE4, and Ubuntu and documentation ﬁles for PHP and OpenOﬃce.

The IT documents dataset consists of sentences written in technical language

and contains ﬁeld-speciﬁc terms and abbreviations. These dataset features make

the task of machine translation of IT documents challenging.

According to the described features of the data, in our further analysis,

we consider the TED dataset as the least complex for a machine translation

task, then Open Subtitles, WSC, and IT documents in the ascending order of

translation diﬃculty.

The model pre-trained on TED was ﬁne-tuned on WSC, Open Subtitles,

and IT Documents for 30 epochs to test how memory content changes after

domain adaptation. We inferred translations for sentences from the TED test

set (5476 samples) and joined Winograd validation and test sets (95 samples)

for the memory content study. There are only train sets available for Open

Subtitles and IT documents, so we randomly selected 6000 samples that did

not appear during ﬁne-tuning from each dataset. The inference sets’ sizes and

sentence lengths are presented in Table 2.

Winograd English sentences are the longest on average, and the Winograd

data has a fewer number of samples than the other three datasets. To equalize

the range of possible translation lengths, we cut oﬀ the sentences from TED,

Open Subtitles, and IT documents that are out of the bounds of the WSC pre-

dicted and reference translations length. We also drop the duplicating samples

from the inference set. As a result, we analyze 4665 and 4875 samples from the

IT dataset, 3874 and 4506 Open Subtitles’ samples, 3477 and 3486 TED sam-

ples, and 95 and 95 WSC samples before and after ﬁne-tuning, correspondingly.

The TED talks dataset used for the initial training is lowercase, and the

4https://www.tensorflow.org/datasets/catalog/opus

5https://www.opensubtitles.org

Value IT En OpSub En TED En WSC En

before after refs before after refs before after refs before after refs

Samples 6000 6000 5476 95

Min length 2 2 12 2 4 12 2 2 2 14 14 16

Max length 139 139 135 109 119 106 164 58 160 42 72 123

Average length 24 23 31 17 19 18 20 20 23 30 34 45

Table 2: Datasets for the working memory content study. For each dataset, we provide the

size of the inference set, minimal, maximal, and average sample length in the tokens for the

predicted translations before ﬁne-tuning (column "before"), after ﬁne-tuning (column "after"),

and reference translations from the data (column "refs").

samples from the IT documents, Open Subtitles, and Winograd Schema Chal-

lenge datasets have ﬁrst letters of the ﬁrst sentence word and proper nouns

capitalized.

5. Study of Memory Content

The baseline model we study is Transformer with the working memory in the

decoder that was pre-trained on the train set from the TED Ru-En dataset. For

pre-training, we used a standard Transformer for the ﬁrst ﬁve epochs and then

added working memory to the decoder and continued pre-training for 15 epochs.

We calculated BLEU 4 (Papineni et al., 2002) and METEOR (Lavie & Agarwal,

2007) scores averaged for three runs on the TED Ru-En validation set to evaluate

the model. The resulting model had BLEU = 21.30 and METEOR = 48.81.

This translation quality is slightly better than demonstrated by the standard

Transformer model after 20 epochs BLEU = 21.16 and METEOR = 40.93.The

scores for all datasets before and after ﬁne-tuning for the standard Transformer

and the Transformer trained with working memory are presented in Table 3.

Model IT documents Open Subtitles TED WSC

BLEU METEOR BLEU METEOR BLEU METEOR BLEU METEOR

Standard, pre-trained 3.48 14.94 8.10 22.18 21.16 40.93 6.00 25.93

Standard, ﬁne-tuned 7.85 17.20 11.71 22.37 22.10 41.89 6.76 23.54

WM, pre-trained 3.42 15.26 8.20 22.27 21.30 48.81 6.58 27.78

WM, ﬁne-tuned 7.83 17.25 11.55 22.58 22.18 49.38 7.50 27.09

WM, pre-trained,

masked mem 20.56 47.75

Table 3: Performance of the models with and without working memory in the decoder after

pre-training and after ﬁne-tuning for all datasets. The ﬁrst two rows show the scores for the

standard Transformer model. The third and the fourth rows correspond to the quality of

predictions of the Transformer with working memory in the decoder. The last row shows the

scores on TED for the model pre-trained with working memory for which the attention on

memory tokens was disabled during inference. The best BLEU and METEOR scores for each

dataset are highlighted in bold.

We also checked if the model trained with working memory uses it to im-

prove the translation quality. We disabled attention on the working memory

slots, so the memory was generated during inference, but the target sequence

could not attend to it. The resulting metrics on the TED validation set were

BLEU = 20.56 and METEOR = 47.75, which are lower on 0.74 BLEU and

1.06 METEOR points than the Transformer scores with working memory with

a standard attention mechanism. This experiment shows the importance of

working memory in the prediction process for better model performance.

Working memory in our experiments has a ﬁxed size of 10 tokens, but the

content of memory varies: it can be ﬁlled with a single repeating token or

have several groups of identical tokens, or contain ten diﬀerent tokens. Table 4

shows the examples of sequences predicted by the model and the corresponding

reference translations.

Data Source sentence (Ru) Model prediction (En) Reference translation (En)

IT установить геометрию

главного окна.

[start](scu)(install)(we )(because )

(add )(ﬁ)(and )(so )(we)(set )we set

up a geometry of the main win-

dow.[end]

sets the client geometry of

the main widget.

IT Временная коллек-

ция содержит фай-

лы, которые вы на-

значили на воспроиз-

ведение, но не хотите

добавлять к какой-

либо коллекции.

[start](the )(an )(but )(the )(the )

(we )(an )(going )(the )(it)the mod-

ern collections contains a ﬁle that

you call it reproduction, but you

don’t want to add to any collection

- any collection.[end]

The Temporary Collection

is a collection that is used to

hold songs you want to play

but that you do n’t want to

add to any collection.

OpSub .. если они захотят

проверить кабину.

[start](in )(and)(“ )(so )(.)(.)(now)

(.)(.)(“ )....if they want to check the

cable test.[end]

...if they want to check out

the cabin.

OpSub Вы попросите кого-то

из друзей сделать это

для вас.

[start](ask )(people )(you )(just )

(ask )(take )(if )(you )(the )(so )you

have to ask someone from friends to

do it for you.......[end]

You’re gonna get one of

your judge friends to do it

for you.

TED Нужно всё очень хоро-

шо спланировать.

[start](it)(we )(we )(you )(we )(it )

(we )(we )(you )(it )we have to do

all this very well.[end]

But it does require a lot of

planning.

TED Знаю, что вас этому не

учили, это не так лег-

ко заметить — но всё

же попробуйте.

[start](now)(i )(i )(i )(and )(and )

(now )(i )(i )(i )i know you don’t

learn that, it’s not so easy to notice

— but still try.[end]

I know that if you’re not

a trained brain expert, it’s

not that obvious, but just

take a look, ok?

WSC Мужчина не мог под-

нять своего сына, по-

тому что он был слиш-

ком тяжел.

[start](the )(the )(the )(the )(the )

(the )(the )(the )(it )(the )the sales

of the person couldn’t rise up his

son, because he was too bold.[end]

The man couldn’t lift his

son because he was so

heavy.

WSC Боб заплатил за

обучение Чарльза в

университете. Но

теперь Чарльз

забыл об этом. Он

не чувствует себя

обязанным.

[start](however)(and )(sta)(the )

(they )(one )(the )(he )(and )

(higher )the prospect of learning to

get closer to university, and the at-

torney now forgotten about that.

the point is that it wasn’t worth tak-

ing care of the same.[end]

Bob paid for Charlie’s col-

lege education, but now

Charlie acts as though it

never happened. He is very

ungrateful.

Table 4: Examples of the sequences of diﬀerent lengths predicted by Transformer with working

memory in the decoder and reference translations. The tokens stored in working memory are

written in parentheses. [start] and [end] denote starting and ending tokens of the sequence,

correspondingly. The remaining tokens represent the translation prediction.

It is natural to assume that translation of more diﬃcult sentences should

require more extensive utilization of working memory. To analyze the intensity

of the working memory usage, we calculated the distributions of the number

of unique working memory tokens. The histograms for all datasets before and

after ﬁne-tuning are presented in Fig. 2.

0246810

unique tokens in memory

0.00

0.05

0.10

0.15

0.20

frequency

(a) Before ﬁne-tuning

0246810

unique tokens in memory

0.00

0.05

0.10

0.15

0.20

0.25

0.30

frequency

IT documents

Open Subtitles

TED

WSC

(b) After ﬁne-tuning

Figure 2: Distributions of the number of unique tokens stored in working memory for the

TED, WSC, IT documents, and Open Subtitles datasets. The histograms’ legend before and

after ﬁne-tuning is presented in ﬁgure (b). Before ﬁne-tuning, WSC, Open Subtitles, and

IT documents memory diversity was larger than TED predictions’ memory diversity. After

ﬁne-tuning, all datasets’ working memory was mostly ﬁlled with a single repeating token. So,

while processing the unseen data, the model exhibits higher variability of the working memory

content. More complex datasets demonstrate higher memory diversity. After ﬁne-tuning, the

model was aligned with the data, and working memory had more repetitive tokens.

From the histograms, we see that before ﬁne-tuning, the memory content is

more diverse for more complex sentences: the Open Subtitles and WSC working

memory most frequently contains three and four diﬀerent tokens, correspond-

ingly, and the IT documents have seven diﬀerent tokens in memory on average.

On the other hand, the TED predictions’ memory is most frequently ﬁlled with

a single token. After ﬁne-tuning, the model tends to store a single repeating to-

ken in memory most frequently for all datasets, sharpening the model attention

on a speciﬁc term.

We collected the model predictions after each epoch to explore the memory

content behavior during ﬁne-tuning. Figure 3 shows the average number of

unique tokens stored in working memory for each ﬁne-tuning experiment.

For all datasets, we see a diversity decrease in working memory during ﬁne-

tuning. The IT documents have higher overall memory diversity than the Wino-

grad schemas and the Open Subtitles data, and the TED transcriptions have

the lowest memory diversity. The IT documents contain the most ﬁeld-speciﬁc

texts, and the highest values of average memory diversity indicate that IT texts

are very challenging to translate compared with the rest of the datasets used in

our experiments.

Keywords are the most relevant and the most important words in a text.

The keywords collection helps summarize the text and perceive the main topics

discussed. So, we expected to ﬁnd a higher number of keywords in memory for

diﬃcult IT and WSC datasets compared with Open Subtitles and TED.

We extracted keywords from predicted translations and reference transla-

tions and calculated how many keywords are stored in working memory. The

21 25 30 35 40 45 50

epoch

avg unique tokens in memory

IT documents

Open Subtitles

TED

WSC

Figure 3: Average working memory diversity measured after each epoch of ﬁne-tuning. The

dashed lines are linear least squares ﬁts. The plot conﬁrms that during ﬁne-tuning, the working

memory content becomes more uniform. The minimal number of unique memory tokens is

larger for more complex texts (IT docs and WSC) than for simpler texts (Open Subtitles and

TED).

keyword extraction was made with the Rapid Automatic Keyword Extraction

method (RAKE) (Rose et al., 2010). The resulting probabilities to ﬁnd one or

more keywords from the predicted sequence in memory are presented in Fig. 4a.

All datasets had at least one of the predicted sentence keywords in memory be-

fore and after ﬁne-tuning. To assess the diﬀerences between keywords data, we

used the Wilcoxon rank-sum test. We provide p-values for statistically signiﬁ-

cant diﬀerences. After ﬁne-tuning, the IT documents had a signiﬁcantly higher

probability to store keywords in memory than before ﬁne-tuning (p < 0.001).

Overall, the more complex texts’ memory stored predicted sequence keywords

more often than the simpler texts’ memory (the diﬀerences were statistically

signiﬁcant between the IT documents and TED datasets (p < 0.01) before and

after ﬁne-tuning and between IT documents and Open Subtitles after ﬁne-tuning

(p < 0.0001).Searching for reference sequence keywords in working memory, we

found that similarly to the predictions’ keywords, the IT documents probability

to ﬁnd keywords in memory signiﬁcantly increased after ﬁne-tuning (p < 0.05).

Working memory could store words that possess semantic content. In lin-

guistics, such terms are called content words. Our hypothesis is that content

words in memory represent key points of the text to be translated. We applied

the keyword extraction method to memory to examine content words. The bar

plot in Fig. 4b shows probabilities to ﬁnd at least one content word in working

memory. Similarly to the keywords probability analysis, the Wilcoxon rank-sum

test was applied to compare content words data. All datasets before and after

ﬁne-tuning contain content words in working memory. In the IT documents’

and WSC memory samples, content words appear signiﬁcantly more frequently

than in TED and Open Subtitles before and after ﬁne-tuning (p < 0.01 for

pairs IT documents-Open Subtitles, IT documents-TED, WSC-Open Subtitles,

WSC-TED. Each comparison was held before the ﬁne-tuning procedure and

IT documents Open Subtitles TED WSC

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

probability

before fine-tuning

after fine-tuning

(a) Predictions’ keywords in memory

IT documents Open Subtitles TED WSC

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

probability

before fine-tuning

after fine-tuning

(b) Content words

Figure 4: Probabilities (with conﬁdence intervals) to ﬁnd one or more (a) keywords extracted

from the model predictions and (b) content words in working memory for all datasets. The

keywords probability diﬀerence is signiﬁcant for the IT documents-TED pair before and after

ﬁne-tuning (p < 0.01) and for IT documents-Open Subtitles pair after ﬁne-tuning (p < 0.0001).

Content words probabilities diﬀer signiﬁcantly for all complex-simple dataset pairs before and

after ﬁne-tuning (p < 0.01).

after it).

Longer sentences usually have more information, so they should be harder

to translate. We checked how the average number of unique tokens in memory

changes with the translation length (Fig. 5). We can see that the diversity of

memory elements depends on the length of predicted translation length neither

before nor after ﬁne-tuning.

15 20 25 30 35 40

predicted sentence length in tokens

avg unique tokens in memory

IT documents

Open Subtitles

TED

WSC

(a) Before ﬁne-tuning

15 20 25 30 35 40

predicted sentence length in tokens

avg unique tokens in memory

IT documents

Open Subtitles

TED

WSC

(b) After ﬁne-tuning

Figure 5: Dependence of the average number of unique tokens in memory from the model

predicted sequence length (with the dashed line showing a linear least squares ﬁt). The

average memory diversity does not signiﬁcantly depend on the model prediction length both

before and after ﬁne-tuning.

Tokens written to working memory are the words or subwords of natural

language. So likewise the unique tokens in memory, we can see what parts of

speech are most likely written to working memory. We collected the distribu-

tions of parts of speech among unique tokens from memory after ﬁne-tuning.

01234567

number of occurrences

ADJ

ADP

ADV

AUX

CCONJ

DET

INTJ

NOUN

NUM

PART

PRON

PROPN

PUNCT

SCONJ

SPACE

SYM

VERB

0.95 0.04 0.01 0.0 0.0 0.0 0.0 0.0

0.87 0.1 0.02 0.01 0.0 0.0 0.0 0.0

0.65 0.21 0.1 0.03 0.01 0.0 0.0 0.0

0.96 0.03 0.01 0.0 0.0 0.0 0.0 0.0

0.55 0.4 0.05 0.0 0.0 0.0 0.0 0.0

0.76 0.16 0.06 0.02 0.0 0.0 0.0 0.0

0.94 0.05 0.01 0.0 0.0 0.0 0.0 0.0

0.9 0.06 0.02 0.01 0.0 0.0 0.0 0.0

0.97 0.03 0.01 0.0 0.0 0.0 0.0 0.0

0.98 0.02 0.0 0.0 0.0 0.0 0.0 0.0

0.52 0.22 0.19 0.05 0.01 0.0 0.0 0.0

0.96 0.03 0.01 0.0 0.0 0.0 0.0 0.0

0.87 0.13 0.0 0.0 0.0 0.0 0.0 0.0

0.95 0.05 0.0 0.0 0.0 0.0 0.0 0.0

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.91 0.05 0.02 0.01 0.01 0.0 0.0 0.0

0.99 0.01 0.0 0.0 0.0 0.0 0.0 0.0

(a) TED

01234567

number of occurrences

ADJ

ADP

ADV

AUX

CCONJ

DET

INTJ

NOUN

NUM

PART

PRON

PROPN

PUNCT

SCONJ

SPACE

SYM

VERB

0.88 0.08 0.03 0.0 0.0 0.0 0.0 0.0

0.76 0.14 0.05 0.04 0.01 0.0 0.0 0.0

0.76 0.18 0.03 0.03 0.0 0.0 0.0 0.0

0.97 0.01 0.01 0.0 0.01 0.0 0.0 0.0

0.94 0.06 0.0 0.0 0.0 0.0 0.0 0.0

0.62 0.33 0.05 0.0 0.0 0.0 0.0 0.0

0.98 0.02 0.0 0.0 0.0 0.0 0.0 0.0

0.72 0.15 0.04 0.04 0.02 0.01 0.01 0.01

0.97 0.03 0.0 0.0 0.0 0.0 0.0 0.0

0.94 0.05 0.01 0.0 0.0 0.0 0.0 0.0

0.86 0.12 0.02 0.0 0.0 0.0 0.0 0.0

0.71 0.18 0.08 0.02 0.0 0.01 0.0 0.0

0.93 0.06 0.0 0.01 0.0 0.0 0.0 0.0

0.94 0.05 0.01 0.0 0.0 0.0 0.0 0.0

0.91 0.09 0.0 0.0 0.0 0.0 0.0 0.0

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.67 0.2 0.08 0.02 0.01 0.0 0.0 0.01

0.95 0.05 0.0 0.0 0.0 0.0 0.0 0.0

(b) WSC

01234567

number of occurrences

ADJ

ADP

ADV

AUX

CCONJ

DET

INTJ

NOUN

NUM

PART

PRON

PROPN

PUNCT

SCONJ

SPACE

SYM

VERB

0.96 0.03 0.0 0.0 0.0 0.0 0.0 0.0

0.92 0.06 0.02 0.01 0.0 0.0 0.0 0.0

0.8 0.14 0.04 0.01 0.0 0.0 0.0 0.0

0.91 0.07 0.01 0.0 0.0 0.0 0.0 0.0

0.86 0.11 0.03 0.0 0.0 0.0 0.0 0.0

0.84 0.11 0.04 0.01 0.0 0.0 0.0 0.0

0.84 0.12 0.03 0.01 0.0 0.0 0.0 0.0

0.88 0.08 0.03 0.01 0.0 0.0 0.0 0.0

0.99 0.01 0.0 0.0 0.0 0.0 0.0 0.0

0.98 0.02 0.0 0.0 0.0 0.0 0.0 0.0

0.47 0.3 0.17 0.04 0.01 0.0 0.0 0.0

0.89 0.08 0.02 0.01 0.0 0.0 0.0 0.0

0.77 0.19 0.03 0.01 0.0 0.0 0.0 0.0

0.98 0.02 0.0 0.0 0.0 0.0 0.0 0.0

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.82 0.11 0.04 0.01 0.01 0.0 0.0 0.0

0.98 0.02 0.0 0.0 0.0 0.0 0.0 0.0

01234567

number of occurrences

ADJ

ADP

ADV

AUX

CCONJ

DET

INTJ

NOUN

NUM

PART

PRON

PROPN

PUNCT

SCONJ

SPACE

SYM

VERB

0.81 0.17 0.02 0.0 0.0 0.0 0.0 0.0

0.87 0.1 0.02 0.0 0.0 0.0 0.0 0.0

0.83 0.14 0.03 0.0 0.0 0.0 0.0 0.0

0.94 0.05 0.01 0.0 0.0 0.0 0.0 0.0

0.9 0.1 0.0 0.0 0.0 0.0 0.0 0.0

0.53 0.26 0.16 0.05 0.0 0.0 0.0 0.0

0.91 0.08 0.0 0.0 0.0 0.0 0.0 0.0

0.65 0.21 0.09 0.03 0.01 0.0 0.0 0.0

0.98 0.02 0.0 0.0 0.0 0.0 0.0 0.0

0.95 0.05 0.0 0.0 0.0 0.0 0.0 0.0

0.85 0.13 0.02 0.0 0.0 0.0 0.0 0.0

0.85 0.11 0.03 0.01 0.0 0.0 0.0 0.0

0.93 0.05 0.01 0.0 0.0 0.0 0.0 0.0

0.91 0.09 0.0 0.0 0.0 0.0 0.0 0.0

0.99 0.01 0.0 0.0 0.0 0.0 0.0 0.0

1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.61 0.21 0.1 0.05 0.02 0.01 0.0 0.0

0.95 0.04 0.01 0.0 0.0 0.0 0.0 0.0

(d) IT documents

Figure 6: Working memory parts of speech distributions for all datasets after ﬁne-tuning. The

distributions are showed up to seven occurrences because none of the examined parts of speech

appeared more than seven times. After ﬁne-tuning, the simpler texts memory (TED and Open

Subtitles) contains coordinating conjunctions, pronouns, and punctuation marks. The WSC

and IT documents’ working memory most frequently comprises determiners, nouns, proper

nouns, and verbs, similarly to the top parts of speech used in memory before ﬁne-tuning.

Figure 6 shows that memory tokens for the WSC and IT documents datasets

signiﬁcantly more often than TED and Open Subtitles store determiners, nouns,

proper nouns, and verbs (the Wilcoxon rank-sum test, p < 0.05 for all mentioned

parts of speech for all complex-simple data pairs). TED and Open Subtitles also

store signiﬁcantly more coordinating conjunctions and pronouns compared to

IT documents and WSC (p < 0.05).Punctuation marks occur signiﬁcantly more

often for TED and Open Subtitles compared to IT documents (p < 0.0001) and

for Open Subtitles compared to WSC (p < 0.0005).

6. Conclusion

This work explored the features of the elements stored in the symbolic work-

ing memory of neural Transformer architecture. We compared the working

memory content for a Russian to English machine translation task. We used

IT documents, Open Subtitles TED Talks transcripts, and Winograd Schema

Challenge datasets as examples of texts from diﬀerent ﬁelds and diﬀerent levels

of translation complexity.

Firstly, we investigated if the information in memory is useful for solving

a machine translation problem. We calculated how many unique tokens were

stored in working memory most frequently and found that memory diversity is

lower for simpler texts rather than for more complex ones. When the data sam-

ple appears in training for the ﬁrst time, the maximum amount of information

about the text is written into memory. The longer the model is being trained,

the better it adjusts to the data, the less diverse the memory content becomes.

Secondly, during the working memory content analysis, we checked if the

working memory content is relevant to the translated sentences. We calculated

how often keywords extracted from translations occur in memory and found that

at least one keyword occurs for all datasets. We also calculated the number

of content words in working memory. Content words more often occur when

translating more challenging texts containing ambiguous (WSC) or ﬁeld-speciﬁc

terms (IT documents). Finally, we found that the memory diversity decreases

with the course of ﬁne-tuning.

We examined parts of speech stored in memory: for more complex texts,

determiners, nouns, proper nouns, and verbs occur more frequently than for

less complex ones. This shows that memory is used to record information about

grammar structures of more complex texts.

Compliance with ethical standards

Ethical approval: This article does not contain any studies involving human

participants or animals performed by any of the authors.

Funding

This work was supported by a grant for research centers in the ﬁeld of

artiﬁcial intelligence, provided by the Analytical Center for the Government

of the Russian Federation under the subsidy agreement (agreement identiﬁer

000000D730321P5Q0002) and the agreement with the Moscow Institute of

Physics and Technology dated November 1, 2021 No. 70-2021-00138.

Declaration of competing interest

The authors declare that they have no known competing ﬁnancial interests or

personal relationships that could have appeared to inﬂuence the work reported

in this paper.

References

Ainslie, J., Ontanon, S., Alberti, C., Cvicek, V., Fisher, Z., Pham, P., Ravula,

A., Sanghai, S., Wang, Q., & Yang, L. (2020). ETC: Encoding long and

structured inputs in transformers. In Proceedings of the 2020 Conference

on Empirical Methods in Natural Language Processing (EMNLP) (pp. 268–

284). Online: Association for Computational Linguistics. URL: https:

//www.aclweb.org/anthology/2020.emnlp-main.19. doi:10.18653/v1/2020.

emnlp-main.19.

Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document

transformer. arXiv:2004.05150.

Burtsev, M. S., Kuratov, Y., Peganov, A., & Sapunov, G. V. (2021). Memory

transformer. arXiv:2006.11527.

Chandar, S., Ahn, S., Larochelle, H., Vincent, P., Tesauro, G., & Bengio, Y.

(2016). Hierarchical memory networks. arXiv:1605.07427.

Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines.

arXiv:1410.5401.

Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-

Barwi´nska, A., Colmenarejo, S. G., Grefenstette, E., Ramalho, T., Agapiou,

J., Badia, A. P., Hermann, K. M., Zwols, Y., Ostrovski, G., Cain, A., King,

H., Summerﬁeld, C., Blunsom, P., Kavukcuoglu, K., & Hassabis, D. (2016).

Hybrid computing using a neural network with dynamic external memory.

Nature,538 , 471–476. URL: http://dx.doi.org/10.1038/nature20101.

Gulcehre, C., Chandar, S., & Bengio, Y. (2017a). Memory augmented neural

networks with wormhole connections. arXiv:1701.08718.

Gulcehre, C., Chandar, S., Cho, K., & Bengio, Y. (2017b). Dynamic neural

turing machine with soft and hard addressing schemes. arXiv:1607.00036.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural

Computation,9, 1735–1780.

Holtzman, A., Buys, J., Forbes, M., & Choi, Y. (2019). The curious case of

neural text degeneration. CoRR,abs/1904.09751 . URL: http://arxiv.org/

abs/1904.09751. arXiv:1904.09751.

Lavie, A., & Agarwal, A. (2007). METEOR: An automatic metric for MT eval-

uation with high levels of correlation with human judgments. In Proceedings

of the Second Workshop on Statistical Machine Translation (pp. 228–231).

Prague, Czech Republic: Association for Computational Linguistics. URL:

https://aclanthology.org/W07-0734.

Levesque, H. J., Davis, E., & Morgenstern, L. (2012). The winograd schema

challenge. In Proceedings of the Thirteenth International Conference on Prin-

ciples of Knowledge Representation and Reasoning KR’12 (p. 552–561). AAAI

Press.

Meng, Y., & Rumshisky, A. (2018). Context-aware neural model for temporal

information extraction. In Proceedings of the 56th Annual Meeting of the As-

sociation for Computational Linguistics (Volume 1: Long Papers) (pp. 527–

536). Melbourne, Australia: Association for Computational Linguistics. URL:

https://www.aclweb.org/anthology/P18-1049. doi:10.18653/v1/P18-1049.

Miyake, A., & Shah, P. (Eds.) (1999). Models of Working Memory: Mecha-

nisms of Active Maintenance and Executive Control.. New York: Cambridge

University Press.

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for

automatic evaluation of machine translation. In Proceedings of the 40th An-

nual Meeting of the Association for Computational Linguistics (pp. 311–318).

Philadelphia, Pennsylvania, USA: Association for Computational Linguistics.

URL: https://aclanthology.org/P02-1040. doi:10.3115/1073083.1073135.

Rae, J. W., Hunt, J. J., Harley, T., Danihelka, I., Senior, A., Wayne, G., Graves,

A., & Lillicrap, T. P. (2016). Scaling memory-augmented neural networks

with sparse reads and writes. arXiv:1610.09027.

Raﬀel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou,

Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with

a uniﬁed text-to-text transformer. J. Mach. Learn. Res.,21 , 140:1–140:67.

URL: http://jmlr.org/papers/v21/20-074.html.

Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic key-

word extraction from individual documents. In M. W. Berry, & J. Ko-

gan (Eds.), Text Mining. Applications and Theory (pp. 1–20). John Wi-

ley and Sons, Ltd. URL: http://dx.doi.org/10.1002/9780470689646.ch1.

doi:10.1002/9780470689646.ch1.

Sagirova, A., & Burtsev, M. (2022). Extending transformer decoder with work-

ing memory for sequence to sequence tasks. doi:https://doi.org/10.1007/

978-3-030-91581-0_34.

Sukhbaatar, S., Szlam, A., Weston, J., & Fergus, R. (2015). End-to-end memory

networks. arXiv:1503.08895.

Tiedemann, J. (2012). Parallel data, tools and interfaces in opus. In LREC .

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,

Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Proceedings of

the 31st International Conference on Neural Information Processing Systems

NIPS’17 (p. 6000–6010). Red Hook, NY, USA: Curran Associates Inc.

Weston, J., Chopra, S., & Bordes, A. (2015). Memory networks.

arXiv:1410.3916.

Ye, Q., Devendra, S., Matthieu, F., Sarguna, P., & Graham, N. (2018). When

and why are pre-trained word embeddings useful for neural machine transla-

tion. In HLT-NAACL.

Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Onta˜n´on, S.,

Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2020). Big bird:

Transformers for longer sequences. CoRR,abs/2007.14062 . URL: https://

arxiv.org/abs/2007.14062.

ResearchGate has not been able to resolve any citations for this publication.

Parallel Data, Tools and Interfaces in OPUS

Article

Full-text available

Jörg Tiedemann

This paper presents the current status of OPUS, a growing language resource of parallel corpora and related tools. The focus in OPUS is to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies. In this paper, we report about new data sets and their features, additional annotation tools and models provided from the website and essential interfaces and on-line services included in the project.

Automatic Keyword Extraction from Individual Documents

Chapter

Full-text available

Mar 2010

This paper introduces a novel and domain-independent method for automatically extracting keywords, as sequences of one or more words, from individual documents. We describe the methods configuration parameters and algorithm, and present an evaluation on a benchmark corpus of technical abstracts. We also present a method for generating lists of stop words for specific corpora and domains, and evaluate its ability to improve keyword extraction on the benchmark corpus. Finally, we apply our method of automatic keyword extraction to a corpus of news articles and define metrics for characterizing the exclusivity, essentiality, and generality of extracted keywords within a corpus.

Long Short-term Memory

Article

Full-text available

Dec 1997

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

BLEU: a Method for Automatic Evaluation of Machine Translation

Article

Full-text available

Oct 2002

Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused.

Extending Transformer Decoder with Working Memory for Sequence to Sequence Tasks

Chapter

Jan 2022

The paper introduces methods of memory augmentation in the Transformer decoder for sequence-to-sequence task. Transformers recently showed state-of-the-art performance on a wide range of NLP tasks. Transformer encodes existing elements of an input sequence but lacks storage for information associated with a context but not presented explicitly in the text. Learnable memory incorporated in Transformer decoder provides additional space to keep information necessary for better performance. Writing tokens from the vocabulary to the memory works like memorizing concepts related to the sequence. Such knowledge can be further employed during sequence processing.

ETC: Encoding Long and Structured Inputs in Transformers

Conference Paper

Jan 2020

Context-Aware Neural Model for Temporal Information Extraction

Conference Paper

Jan 2018

When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?

Conference Paper

Jan 2018

Hybrid computing using a neural network with dynamic external memory

Article

Oct 2016

Artificial neural networks are remarkably adept at sensory processing, sequence learning and reinforcement learning, but are limited in their ability to represent variables and data structures and to store data over long timescales, owing to the lack of an external memory. Here we introduce a machine learning model called a differentiable neural computer (DNC), which consists of a neural network that can read from and write to an external memory matrix, analogous to the random-access memory in a conventional computer. Like a conventional computer, it can use its memory to represent and manipulate complex data structures, but, like a neural network, it can learn to do so from data. When trained with supervised learning, we demonstrate that a DNC can successfully answer synthetic questions designed to emulate reasoning and inference problems in natural language. We show that it can learn tasks such as finding the shortest path between specified points and inferring the missing links in randomly generated graphs, and then generalize these tasks to specific graphs such as transport networks and family trees. When trained with reinforcement learning, a DNC can complete a moving blocks puzzle in which changing goals are specified by sequences of symbols. Taken together, our results demonstrate that DNCs have the capacity to solve complex, structured tasks that are inaccessible to neural networks without external read-write memory.

Longformer: The long-document transformer

Jan 2020

I Beltagy
M E Peters
A Cohan

Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv:2004.05150.

Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task

Abstract and Figures

Recommended publications

Complexity of symbolic representation in working memory of Transformer correlates with the complexit...

Extending Transformer Decoder with Working Memory for Sequence to Sequence Tasks

Scaling Transformer to 1M tokens and beyond with RMT

Memory Transformer