Conference PaperPDF Available

Real-valued Syntactic Word Vectors (RSV) for Greedy Neural Dependency Parsing

Authors:

Abstract

We show that a set of real-valued word vectors formed by right singular vectors of a transformed co-occurrence matrix are meaningful for determining different types of dependency relations between words. Our experimental results on the task of dependency parsing confirm the superiority of the word vectors to the other sets of word vectors generated by popular methods of word embedding. We also study the effect of using these vectors on the accuracy of dependency parsing in different languages versus using more complex parsing architectures.
Real-valued Syntactic Word Vectors (RSV) for Greedy Neural
Dependency Parsing
Ali Basirat and Joakim Nivre
Department of Linguistics and Philology
Uppsala University
{ali.basirat,joakim.nivre}@lingfil.uu.se
Abstract
We show that a set of real-valued word
vectors formed by right singular vectors
of a transformed co-occurrence matrix are
meaningful for determining different types
of dependency relations between words.
Our experimental results on the task of de-
pendency parsing confirm the superiority
of the word vectors to the other sets of
word vectors generated by popular meth-
ods of word embedding. We also study
the effect of using these vectors on the
accuracy of dependency parsing in differ-
ent languages versus using more complex
parsing architectures.
1 Introduction
Greedy transition-based dependency parsing is ap-
pealing thanks to its efficiency, deriving a parse
tree for a sentence in linear time using a discrimi-
native classifier. Among different methods of clas-
sification used in a greedy dependency parser, neu-
ral network models capable of using real-valued
vector representations of words, called word vec-
tors, have shown significant improvements in both
accuracy and speed of parsing. It was first pro-
posed by Chen and Manning (2014) to use word
vectors in a 3-layered feed-forward neural network
as the core classifier in a transition-based depen-
dency parser. The classifier is trained by the stan-
dard back-propagation algorithm. Using a limited
number of features defined over a certain number
of elements in a parser configuration, they could
build an efficient and accurate parser, called the
Stanford neural dependency parser. This archi-
tecture then was extended by Straka et al. (2015)
and Straka et al. (2016). Parsito (Straka et al.,
2015) adds a search-based oracle and a set of mor-
phological features to the original architecture in
order to make it capable of parsing the corpus of
universal dependencies. UDPipe (Straka et al.,
2016) adds a beam search decoding to Parsito
in order to improve the parsing accuracy at the cost
of decreasing the parsing speed.
We propose to improve the parsing accuracy
in the architecture introduced by Chen and Man-
ning (2014) through using more informative word
vectors. The idea is based on the greedy nature
of the back-propagation algorithm which makes
it highly sensitive to the initial state of the algo-
rithm. Thus, it is expected that more qualified
word vectors positively affect the parsing accu-
racy. The word vectors in our approach are formed
by right singular vectors of a matrix returned by
a transformation function that takes a probability
co-occurrence matrix as input and expand the data
massed around zero. We show how the proposed
method is related to HPCA (Lebret and Collobert,
2014) and GloVe (Pennington et al., 2014).
Using these word vectors with the Stanford
parser we could obtain the parsing accuracy of
93.0% UAS and 91.7% LAS on Wall Street Jour-
nal (Marcus et al., 1993). The word vectors con-
sistently improve the parsing models trained with
different types of dependencies in different lan-
guages. Our experimental results show that pars-
ing models trained with Stanford parser can be as
accurate or in some cases more accurate than other
parsers such as Parsito, and UDPipe.
2 Transition-Based Dependency Parsing
A greedy transition-based dependency parser de-
rives a parse tree from a sentence by predicting a
sequence of transitions between a set of configu-
rations characterized by a triple c= (Σ, B, A),
where Σis a stack that stores partially processed
nodes, Bis a buffer that stores unprocessed nodes
in the input sentence, and Ais a set of partial
parse trees assigned to the processed nodes. Nodes
are positive integers corresponding to linear posi-
tions of words in the input sentence. The process
of parsing starts from an initial configuration and
ends with some terminal configuration. The tran-
sitions between configurations are controlled by a
classifier trained on a history-based feature model
which combines features of the partially built de-
pendency tree and attributes of input tokens.
The arc-standard algorithm (Nivre, 2004) is
among the many different algorithms proposed for
moving between configurations. The algorithm
starts with the initial configuration in which all
words are in B,Σis empty, and Aholds an ar-
tificial node 0. It uses three actions Shift,Right-
Arc, and Left-Arc to transition between the config-
urations and build the parse tree. Shift pushes the
head node in the buffer into the stack uncondition-
ally. The two actions Left-Arc and Right-Arc are
used to build left and right dependencies, respec-
tively, and are restricted by the fact that the final
dependency tree has to be rooted at node 0.
3 Stanford Dependency Parser
The Stanford dependency parser can be consid-
ered as a turning point in the history of greedy
transition-based dependency parsing. The parser
could significantly improve both the accuracy and
speed of dependency parsing. The key success of
the parser can be summarized in two points: 1) ac-
curacy is improved by using a neural network with
pre-trained word vectors, and 2) efficiency is im-
proved by pre-computation to keep the most fre-
quent computations in the memory.
The parser is an arc-standard system with a
feed-forward neural-network as its classifier. The
neural network consists of three layers: An in-
put layer connects the network to a configuration
through 3real-valued vectors representing words,
POS tags, and dependency relations. The vec-
tors that represent POS-tags and dependency re-
lations are initialized randomly but those that rep-
resent words are initialized by word vectors sys-
tematically extracted from a corpus. Each of these
vectors are independently connected to the hidden
layer of the network through three distinct weight
matrices. A cube activation function is used in
the hidden layer to model the interactions between
the elements of the vectors. The activation func-
tion resembles a third degree polynomial kernel
that enables the network to take different combi-
nations of vector elements into consideration. The
output layer generates probabilities for decisions
between different actions in the arc-standard sys-
tem. The network is trained by the standard back-
propagation algorithm that updates both network
weights and vectors used in the input layer.
4 Word Vectors
Dense vector representations of words, in this pa-
per known as word vectors, have shown great im-
provements in natural language processing tasks.
An advantage of this representation compared to
the traditional one-hot representation is that the
word vectors are enriched with information about
the distribution of words in different contexts.
Following Lebret and Collobert (2014), we pro-
pose to extract word vectors from a co-occurrence
matrix as follows: First we build a co-occurrence
matrix Cfrom a text. The element Ci,j is a max-
imum likelihood estimation of the probability of
seeing word wjin the context of word wi, (i.e.,
Ci,j =p(wj|wi). It results in a sparse matrix
whose data are massed around zero because of the
disproportional contribution of the high frequency
words in estimating the co-occurrence probabili-
ties. Each column of Ccan be seen as a vector in
a high-dimensional space whose dimensions cor-
respond to the context words. In practice, we need
to reduce the dimensionality of these vectors. This
can be done by standard methods of dimensional-
ity reduction such as principal component analy-
sis. However, the high density of data around zero
and the presence of a small number of data points
far from the data mass can lead to some meaning-
less discrepancies between word vectors.
In order to have a better representation of the
data we expand the probability values in Cby
skewing the data mass from zero toward one. This
can be done by any monotonically increasing con-
cave function that magnifies small numbers in its
domain while preserving the given order. Com-
monly used transformation functions with these
characteristics are the logarithm function, the hy-
perbolic tangent, the power transformation, and
the Box-cox transformation with some specific
parameters. Fig. 1 shows how a transformation
function expands high-dimensional word vectors
massed around zero.
After applying fon C, we centre the column
vectors in f(C)around their mean and build the
word vectors as below:
Υ = γVT
n,k (1)
where Υis a matrix of kdimensional word vec-
tors associated with nwords, VT
n,k is the top k
-10 -8 -6 -4 -2 0 2
-10
-8
-6
-4
-2
0
2
4
the
,
.
of
NUMBER
to
andin a
for ''
for '' ‘‘
that is
on
's
was
-lrb-
-rrb-
with as
by ithe
at said
from his
be
are an
(a)
-70 -60 -50 -40 -30 -20 -10 0 10
-20
-15
-10
-5
0
5
10
the
,.of
NUMBER
to
and
in
a
for''
‘‘
that
is
on 's
was
-lrb-
-rrb-
with
as by
it
he
at
said
from
his
be
are
an
are
an
;
have
:
has
but
this
not i
were
they
had
who
or
which
their
will
--
its
new
one
after
been
also
we
more
would
about
-
you
first
two
up
when
n't
'
all
there
her
she
other
out
people
can
than
year
into
some
over
percent
do
if
time
$
last
no
years
_
government
world
only
what
could
most
so
them
may
him
president
three
million
u.s.
state
united
many
againstlike
during
before
where
now
did
just
because
while
us
states
between
since
made
city
such
then
national
through
under
any
my
down
these
being
group
company
back
country
day
news
american
even
york
both
second
south our those
including
part
well
get
still make
war
school
work
off
...
international
team
used
how
very
week
much
way
here
four
%
=
end
later
should
china
minister
see
home
house
another
?
billion
NUMBERth
take
former
police
around
high
next
times
told
officials
me
party
north
tuesday
called
use
same
say
wednesday
public market
friday
thursday
monday
several
game
until
bank
going
good
think
says
your
military
early
number
each
season
reuters
however
family
set
go
according
university
does
five
long
following
political
found
public market
friday
thursday
monday
several
game
until
bank
going
good
think
says
your
military
early
number
each
season
reuters
however
family
set
go
according
university
does
five
long
following
political
found
known
life
security
left
service
major
know
own
general
place
top
foreign
right
court
area
points
-rsb-
took
-lsb-
local
won
john days
members
among
business
economic
best
system
help
washington
women
series
west
want
march
came
report
children
show
sunday
without
name
official
power
months
/
support
few
little
became held
come
month
man
countries
film
oil
dollars
|
never
men
white companies
law
third
june
six
expected
moved
iraq
de
central NUMBERs
european
killed
too
began
british
reported
late
every
games
play
office
saturday
july
east
put
league
great
case
money bush
today
town
center
final
small
financial
played
open
april
added
chief
NUMBER.NUMBER
least
trade
released
meeting
line
within &
got
need
again
information
air
past
music county
're
federal
program
along
head
big
although
death
went
must
history
point
capital
might
side
become
night
army
order
japan
born
due
force
health
far
near
close
ago
forces
french
prime
earlier away
old
though
union
win
saying
.NUMBER
january
district
leader
based
prices
large
development
department
often
september
region
move
record
lost
lead
france
water
israel
main
whether
led
director
member
less
others
plan
chinese
age
already
recent
different
run
story december
cup
once
economy
control
category
NUMBER-NUMBER
given
better
rights
america
october
peace
statement
council
give
young
price
per
total
half
deal
november
died
making
important
india
london
campaign
leaders russia
stock
shares
press
building
further
population
election
hit
having
station
change
august
lot
club
college
talks
research
book
students
sales
black
seen
really
television
include
conference
human
NUMBER-NUMBER:NUMBER
community
street
industry troops
start
call
issue
find
taken
agency
rate
look
announced
asked
something
defense
bill
clinton
free
attack
committee
across
current
working
church
rose
interest
media
europe
services
received
album
germany
results
started
-NUMBER
real san
february
list
special
index
decision
things
spokesman
players
growth
southern
administration
keep
enough
fell
period
policy
site
named
despite
using
possible
groups
role
board
average
process
together
weeks
almost
seven
level
return
'm
german
nations
cut
behind
football
words
father
've
NUMBER-year-old
yet
whose
strong
england
plans
israeli
northern
road australia
share
social
higher
full
russian
live
western
energy
food future
trying
david
always
outside
land
form
car
single
island
able
george
nearly
nation
issues
river
park
production
largest
likely
taking
global
data century
himself
why
round
race
nuclear
king
field
hours
education
son
areas
visit
attacks
california
fire
workers
key
player
action
!
face
japanese
leading
eight
exchange
ever
agreement
included
vote minutes
career africa
gave
red congress
lower
senior
available
korea
palestinian
markets
match
association
democratic
built
problems
executivevillage
located
english
art
result
pay
similar
thought
secretary
job
ministry
believe
front
continue
private
band
file
position
sent
radio
clear
project
body
love
published
living
increase
short
previous
done
authorities
reports
middle
*
hard
dollar
wanted
title
>
course
opposition
ca
iran elections
coach
miles
problem
commission
served
act
province
playing
los
daily
mother
release
shot
wife
low
soon
division
ii
woman
fact
canada
michael
study
rates
meet
means
fall
britain
post
song thing
provide
victory
worked
saw
(b)
Figure 1: PCA visualization of the high-dimensional
column-vectors of a co-occurrence matrix (a) before and (b)
after applying the transformation function 10
x
right singular vectors of f(C), and γ=λnis a
constant factor to scale the unbounded data in the
word vectors. In the following, we will refer to our
model as RSV, standing for Right Singular word
Vector or Real-valued Syntactic word Vectors as it
is mentioned in the title of this paper.
5 Experimental Setting
We restrict our experiments to three languages,
English, Swedish, and Persian. Our experiments
on English are organized as follows: Using dif-
ferent types of transformation functions, we first
extract a set of word vectors that gives the best
parsing accuracy on our development set. Then we
study the effect of dimensionality on parsing per-
formance. Finally, we give a comparison between
our best results and the results obtained from other
sets of word vectors generated with popular meth-
ods of word embedding. Using the best transfor-
mation function obtained for English, we extract
word vectors for Swedish and Persian. These word
vectors are then used to train parsing models on
the corpus of universal dependencies.
The English word vectors are extracted from
a corpus consisting of raw sentences in Wall
Street Journal (WSJ) (Marcus et al., 1993), En-
glish Wikicorpus,1Thomson Reuters Text Re-
1http://www.cs.upc.edu/˜nlp/wikicorpus/
search Collection (TRC2), English Wikipedia cor-
pus,2, and the Linguistic Data Consortium (LDC)
corpus. We concatenate all the corpora and split
the sentences by the OpenNLP sentence splitting
tool.The Stanford tokenizer is used for tokeniza-
tion. Word vectors for Persian are extracted from
the Hamshahri Corpus (AleAhmad et al., 2009),
Tehran Monolingual Corpus,3and Farsi Wikipedia
download from Wikipedia Monolingual Corpora.4
The Persian text normalizer tool (Seraji, 2015) is
used for sentence splitting and tokenization.5Word
vectors for Swedish are extracted from Swedish
Wikipedia available at Wikipedia Monolingual
Corpora, Swedish web news corpora (2001-2013)
and Swedish Wikipedia corpus collected by Sprk-
banken. 6The OpenNLP sentence splitter and to-
kenizer are used for normalizing the corpora.
We replace all numbers with a special token
NUMBER and convert uppercase letters to lower-
case forms in English and Swedish. Word vectors
are extracted only for the unique words appearing
at least 100 times. We choose the cut-off word
frequency of 100 because it is commonly used as
a standard threshold in the other references. The
10 000 most frequent words are used as context
words in the co-occurrence matrix. Table 1 rep-
resents some statistics of the corpora.
#Tokens #W1#W100 #Sents
English 8×10914 462 600 404 427 4 ×108
Persian 4×1081 926 233 60 718 1 ×107
Swedish 6×1085 437 176 174 538 5 ×107
Table 1: Size of the corpora from which word vectors are
extracted; #Tokens: total number of tokens; #Wk: number
of unique words appearing at least ktimes in the corpora;
#Sents: number of sentences.
The word vectors are evaluated with respect
to the accuracy of parsing models trained with
them using the Stanford neural dependency parser
(Chen and Manning, 2014). The English parsing
models are trained and evaluated on the corpus of
universal dependencies (Nivre et al., 2016) version
1.2(UD) and Wall Street Journal (WSJ) (Marcus
et al., 1993) annotated with Stanford typed depen-
dencies (SD) (De Marneffe and Manning, 2010)
and CoNLL syntactic dependencies (CD) (Johans-
2https://dumps.wikimedia.org/enwiki/latest/
enwiki-latest- pages- articles.xml.bz2- rss.xml
3http://ece.ut.ac.ir/system/files/NLP/Resources/
4http://linguatools.org/tools/corpora/
wikipedia-monolingual- corpora
5http://stp.lingfil.uu.se/˜mojgan
6https://spraakbanken.gu.se/eng/resources/corpus
son and Nugues, 2007). We split WSJ as follow:
sections 0221 for training, section 22 for devel-
opment, and section 23 as test set. The Stanford
conversion tool (De Marneffe et al., 2006) and the
LTH conversion tool7are used for converting con-
stituency trees in WSJ to SD and CD. The Swedish
and Persian parsing models are trained on the cor-
pus of universal dependencies. All the parsing
models are trained with gold POS tags unless we
clearly mention that predicted POS tags are used.
6 Results
In the following we study how word vectors gener-
ated by RSV influence parsing performance. RSV
has four main tuning parameters: 1) the context
window, 2) the transformation function f, 3) the
parameter λused in the normalization step, and
4) the number of dimensions. The context win-
dow can be symmetric or asymmetric with dif-
ferent length. We choose the asymmetric context
window with length 1i.e., the first preceding word,
as it is suggested by Lebret and Collobert (2015)
for syntactic tasks. λis a task dependent param-
eter that controls the variance of word vectors. In
order to find the best value of λ, we train the parser
with different sets of word vectors generated ran-
domly by Gaussian distributions with zero mean
and isotropic covariance matrices λIwith values
of λ(0,1]. Fig. 2a shows the parsing accura-
cies obtained from these word vectors. The best
results are obtained from word vectors generated
by λ= 0.1and λ= 0.01. The variation in the re-
sults shows the importance of the variance of word
vectors on the accuracy of parser. Regarding this
argument, we set the normalization parameter λin
Eq. 1 equal to 0.1. The two remaining parameters
are explored in the following subsections.
6.1 Transformation Function
We have tested four sets of transformation func-
tions on the co-occurrence matrix:
f1= tanh(nx)
f2=n
x
f3=n(n
x1)
f4=log(2n+1x+1)
log(2n+1+1)
where x[0,1] is an element of the matrix and
nis a natural number that controls the degree of
skewness, the higher the value of nis, the more
7http://nlp.cs.lth.se/software/
treebank-converter/
the data will be skewed. Fig. 2b shows the effect
of using these transformation functions on parsing
accuracy. Best results are obtained from nth-root
and Box-cox transformation functions with n= 7,
which are closely related to each other. Denoting
the set of word vectors obtained from the transfor-
mation function fas Υ(f), it can be shown that
Υ(f3) = Υ(f2)n, since the effect of coefficient
nin the first term of f3is cancelled out by the right
singular vectors in Eq. 1.
Fig. 2c visualizes best transformation functions
in each of the function sets. All the functions
share the same property of having relatively high
derivatives around 0which allows of skewing data
in the co-occurrence matrix to right (i.e., close to
one) and making a clear gap between the syntactic
structures that can happen (i.e., non-zeros proba-
bilities in the co-occurrence matrix) and those that
cannot happen (i.e., zero probabilities in the co-
occurrence matrix. This movement from zero to
one, however, can lead to disproportional expan-
sions between the data close to zero and those that
are close to one. It is because the limit of the ra-
tio of the derivative of fi(x)i= 1,2,3,4to the
derivative of fi(y)as xapproaches to 0and yap-
proaches to 1is infinity i.e., limx0,y1
f
i(x)
f
i(y)=
. The almost uniform behaviour of f1(x)for
x > 0.4results in a small variance in the generated
data that will be ignored by the subsequent sin-
gular value decomposition step and consequently
loses the information provided with the most prob-
able context words. Our solution to these prob-
lems is to use the following piecewise transforma-
tion function f:
f=tanh(7x)xθ
7
x x > θ (2)
where θ= 10nand nN. This function ex-
pands the data in a more controlled manner (i.e.,
limx0,y1f(x)
f(y)= 49) with less lost in informa-
tion provided with the variance of data. Using this
function with θ= 107, we could get UAS of 92.3
and LAS of 90.9on WSJ development set anno-
tated with Stanford typed dependencies, which is
slightly better than other transformation functions
(see Fig. 2b). We obtain UAS of 92.0and LAS of
90.6on the WSJ test set with the same setting.
6.2 Dimensionality
The dimensionality of word vectors is determined
by the number of singular vectors used in Eq. 1.
1e-05
0.0001
0.001
0.01
0.1
0.2
1
90.2
90.4
90.6
90.8
91
91.2
91.4
91.6
(a)
12345678910
91
91.2
91.4
91.6
91.8
92
92.2
92.4
f1
f2
f3
f4
f
(b)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
f1= tanh(7x)
f2=7
x
1
7(f3+ 7) = 7
x
f4=log(29x+1)
29+1
(c)
Figure 2: Unlabelled attachment score of parsing models trained with (a) the randomly generated word vectors, and (b)
the systematically extracted word vectors using different transformation functions. The experiments are carried out with 50
dimensional word vectors. Parsing models are evaluated on the development set in Wall Street Journal annotated with Satnford
typed dependencies (SD). fin (b) is the piecewise transformation function shown in Eq. 2. c: the transformation functions in
each function set resulting in the best parsing models. The vertical axis shows the data in a probability co-occurrence matrix
and the vertical axis shows their projection after transformation. For better visualization, the range of f3is scaled to [0,1].
High dimensional word vectors are expected to
result in higher parsing accuracies. It is because
they can capture more information from the origi-
nal data, i.e., the Frobenius norm of the deference
between the original matrix and its truncated es-
timation depends on the number of top singular
vectors used for constructing the truncated ma-
trix. This achievement, however, is at the cost
of more computational resources a) to extract the
word vectors, and b) to process the word vectors
by parser. The most expensive step to extract the
word vectors is the singular value decomposition
of the transformed co-occurrence matrix. Using
the randomized SVD method described by Tropp
et al. (2009), the extraction of ktop singular vec-
tors of an m×nmatrix requires O(mn log(k))
floating point operations. It shows that the cost for
having larger word vectors grows logarithmically
with the number of dimensions.
The parsing performance is affected by the di-
mensionality of the word vectors, fed into the in-
put layer of the neural network, in two ways: First,
higher number of dimensions in the input layer
lead to a larger weight matrix between the input
layer and the hidden layer. Second, larger hid-
den layer is needed to capture the dependencies
between the elements in the input layer. Given
a set of word vectors with kdimensions con-
nected to the hidden layer with hhidden units, the
weight matrix between the input layer and the hid-
den layer grows with the scale of O(kh), and the
weight matrix between the hidden layer and the
output layer grows with the scale of O(h). For
each input vector, the back-propagation algorithm
passes the weight matrices three times per iteration
1) to forward each input vector through the net-
work, 2) to back propagate the errors, generated
by the inputs, and 3) to update the network param-
eters. So, each input vector needs O(3(kh +h)))
time to be processed by the algorithm. Given the
trained model, the output signals are generated
through only one forward pass.
Table 2 shows how high dimensional word vec-
tors affect the parsing performance. In general,
increasing the number of hidden units leads to a
more accurate parsing model at a linear cost of
parsing speed. Increasing the dimensionality of
word vectors to 200 dimensions consistently in-
creases the parsing accuracy at again the linear
time of parsing speed. However, increasing both
the dimensionality of word vectors and the size of
the hidden layer leads to a quadratic decrease in
parsing speed. The best results are obtained from
the parsing model trained with 100-dimensional
word vectors and 400 hidden units, resulting in the
parsing accuracy of 93.0UAS and 91.7LAS on
our test set, +1.0UAS and +1.1LAS improve-
ment over what we obtained with 50 dimensional
word vectors. It is obtained at the cost of 47% re-
duction in the parsing speed.
6.3 Comparison and Consistency
We evaluate the RSV word vectors on different
types of dependency representations and differ-
ent languages. Table 3 gives a comparison be-
tween RSV and different methods of word embed-
ding with respect to their contributions to depen-
dency parsing and the time required to generate
word vectors for English. All the parsing mod-
els are trained with 400 hidden units and 100-
dimensional word vectors extracted from English
raw corpus described in Sec. 5. The word vec-
h200 300 400
k UAS LAS P UAS LAS P UAS LAS P
50 92.3 90.9 392 92.9 91.5 307 93.0 91.6 237
100 92.6 91.2 365 92.9 91.5 263 93.1 91.8 206
150 92.6 91.2 321 92.9 91.5 236 93.1 91.8 186
200 92.7 91.3 310 93.1 91.7 212 93.1 91.8 165
250 92.7 91.2 286 92.9 91.5 201 93.0 91.7 146
300 92.6 91.2 265 92.9 91.6 180 92.9 91.5 119
350 92.7 91.2 238 92.8 91.4 174 92.9 91.5 111
400 92.6 91.2 235 92.8 91.3 141 93.0 91.5 97
Table 2: The performance of parsing models trained with
k-dimensional word vectors and hhidden units. The pars-
ing accuracies (UAS, LAS) are for the development set. P:
parsing speed (sentence/second).
tors are extracted by a Linux machine running on
12 CPU cores. The free parameters of the word
embedding methods, i.e., context type and con-
text size, have been tuned on the development
set and the best settings, resulting in the high-
est parsing accuracies, were then chosen for com-
parison. It leads us to asymmetric window of
length 1for RSV,GloVe and HPCA, and sym-
metric window of length 1for word2vec mod-
els, CBOW and SkipGram. The GloVe word vec-
tors are extracted by available implementation of
GloVe (Pennington et al., 2014) running for 50
iterations. The HPCA word vectors are extracted
by our implementation of the method (Lebret and
Collobert, 2014). CBOW and SkipGram word
vectors are extracted by available implementation
of word2vec (Mikolov et al., 2013) running for
10 iterations, and a negative sampling value of 5.
In order to show the validity of the results, we per-
form a bootstrap statistical significance test (Berg-
Kirkpatrick et al., 2012) on the results obtained
from each parsing experiment and RSV with the
null hypothesis H0:RSV is no better than the
model B, where Bcan be any of the word embed-
ding methods. The resulting p-values are reported
together with the parsing accuracies.
The empirical results show that HPCA,RSV, and
GloVe are ranked as fastest methods of word em-
bedding in order of time. The reason why these
methods are faster than word2vec is because
they scan the corpus only one time and then store
it as a co-occurrence matrix in memory. The rea-
son why HPCA is faster than RSV is because HPCA
stores the co-occurrence matrix as a sparse matrix
but RSV stores it as a full matrix. This expense
makes RSV more qualified than HPCA when they
are used in the task of dependency parsing.
The results obtained from RSV word vectors are
Model Time
SD CD UD
UAS LAS UAS LAS UAS LAS
p-val p-val p-val p-val p-val p-val
CBOW 8741 93.0 91.5 93.4 92.6 88.0 85.4
0.00 0.00 0.02 0.00 0.00 0.00
SGram 11113 93.0 91.6 93.4 92.5 87.4 84.9
0.06 0.02 0.00 0.00 0.00 0.00
GloVe 3150 92.9 91.6 93.5 92.6 88.4 85.8
0.02 0.02 0.04 0.06 0.54 0.38
HPCA 2749 92.1 90.8 92.5 91.7 86.6 84.0
0.00 0.00 0.00 0.00 0.00 0.00
RSV 2859 93.1 91.8 93.6 92.8 88.4 85.9
Table 3: Performance of word embedding methods: Qual-
ity of word vectors are measured with respect to parsing mod-
els trained with them. The efficiency of models is measured
with respect to the time (seconds) required to extract a set of
word vectors. Parsing models are evaluated on our English
development set; SGram: SkipGram, SD: Stanford typed De-
pendencies, CD: CoNLL Dependencies, UD: Universal De-
pendencies, and p-val: p-value of the null hypothesis: RSV is
no better than the word embedding method corresponding to
each cell of the table.
comparable and slightly better than other sets of
word vectors. The difference between RSV and
other methods is more clear when one looks at the
difference between the labelled attachment scores.
Apart from the parsing experiment with GloVe on
the universal dependencies, the relatively small
p-values reject our null hypothesis and confirms
that RSV can result in more qualified word vectors
for the task of dependency parsing. In addition
to this, the constant superiority of the results ob-
tained from RSV on different dependency styles is
an evidence that the results are statistically signif-
icant, i.e., the victory of RSV is not due merely
to chance. Among the methods of word embed-
ding, we see that the results obtained from GloVe
are more close to RSV, especially when they come
with universal dependencies. We show in Sec. 8
how these methods are connected to each other.
Table 4 shows the results obtained from Stan-
ford parser trained with RSV vectors and two
other greedy transition-based dependency parsers
MaltParser (Nivre et al., 2006) and Parsito
(Straka et al., 2015). All the parsing models are
trained with the arc-standard system on the corpus
of universal dependencies. Par-St and Par-Sr refer
to the results reported for Parsito trained with
static oracle and search-based oracle. As shown,
in all cases, the parsing models trained with Stan-
ford parser and RSV (St-RSV) are more accurate
than other parsing models. The superiority of the
results obtained from St-RSV to Par-Sr shows the
importance of word vectors in dependency pars-
ing in comparison with adding more features to
the parser or performing the search-based oracle.
Par-St Par-Sr Malt St-RSV
UAS UAS UAS UAS
LAS LAS LAS LAS
English 86.7 87.4 86.387.6
84.2 84.7 82.984.9
Persian 83.8 84.5 80.885.4
80.2 81.1 77.282.4
Swedish 85.3 85.9 84.786.2
81.4 82.3 80.382.5
Table 4: Accuracy of dependency parsing. Par-St and Par-
Sr refer to the Parsito models trained with static oracle
and search-based oracle. St-RSV refers to the Stanford parser
trained with RSV vectors.
The results obtained from Stanford parser and
UDPipe (Straka et al., 2016) are summarized in
Table 5. The results are reported for both predicted
and gold POS tags. UDPipe sacrifice the greedy
nature of Parsito through adding a beam search
decoder to it. In general, one can argue that
UDPipe adds the following items to the Stan-
ford parser: 1) a search-based oracle, 2) a set of
morphological features, and 3) a beam search de-
coder. The almost similar results obtained from
both parsers for English show that a set of infor-
mative word vectors can be as influential as the
three extra items added by UDPipe. However,
the higher accuracies obtained from UDPipe for
Swedish and Persian, and the fact that the training
data for these languages are considerably smaller
than English, show the importance of these ex-
tra items on the accuracy of parsing model when
enough training data is not provided.
Predicted tags Gold tags
UDPipe St-RSV UDPipe St-RSV
UAS UAS UAS UAS
LAS LAS LAS LAS
English 84.084.687.587.6
80.280.9 85.084.9
Swedish 81.280.486.2 86.2
77.076.683.282.5
Persian 84.182.486.385.4
79.778.183.082.4
Table 5: Accuracy of dependency parsing on the corpus of
universal dependencies. St-RSV refers to the Stanford parser
trained with RSV vectors.
7 Nature of Dimensions
In this section, we study the nature of dimensions
formed by RSV. Starting from high-dimensional
space formed by the transformed co-occurrence
matrix f(C), word similarities can be measured
by a similarity matrix K=f(CT)f(C)whose
leading eigenvectors, corresponding to the leading
right singular vectors of f(C), form the RSV word
vectors. It suggests that RSV dimensions measure
a typical kind of word similarity on the basis of
variability of word’s contexts, since the eigenvec-
tors of Kaccount for the directions of largest vari-
ance in the word vectors defined by f(C).
To assess the validity of this statement, we study
the dimensions individually. For each dimension,
we first project all unit-sized word vectors onto it
and then sort the resulting data in ascending order
to see if any syntactic or semantic regularities can
be seen. Table 6 shows 10 words appearing in the
head of ordered lists related to the first 10 dimen-
sions. The dimensions are indexed according to
their related singular vectors. The table shows that
to some extent the dimensions match syntactic and
semantic word categories discussed in linguistics.
There is a direct relation between the indices and
the variability of word’s contexts. The regulari-
ties between the words appearing in highly vari-
able contexts, mostly the high frequency words,
are captured by the leading dimensions.
To a large extent, the first dimension accounts
for the degree of variability of word’s contexts.
Lower numbers are given to words that appear
in highly flexible contexts (i.e., high frequency
words such as as, but, in and ...). Dimensions 2
5are informative for determining syntactic cate-
gories such as adjectives, proper nouns, function
words, and verbs. Dimensions 2and 8give lower
numbers to proper names (interestingly, mostly
last names in 2and male first names in 8). Some
kind of semantic regularity can also be seen in
most dimensions. For example, adjectives in di-
mension 2are mostly connected with society,
nouns in dimension 6denote humans, nouns in di-
mension 7are abstract, and words in dimension 9
are mostly connected to affective emotions.
8 Related Work on Word Vectors
The two dominant approaches to creating word
vectors (or word embeddings) are: 1) incremental
methods that update a set of randomly initialized
word vectors while scanning a corpus (Mikolov
et al., 2013; Collobert et al., 2011), and 2) batch
methods that extract a set of word vectors from a
co-occurrence matrix (Pennington et al., 2014; Le-
bret and Collobert, 2014). Pennington et al. (2014)
show that both approaches are closely related to
Dim Top 10 words
1 . – , is as but in ... so
2 domestic religious civilian russian physical social iraqi
japanese mexican scientific
3 mitchell reid evans allen lawrence palmer duncan rus-
sell bennett owen
4 . but in – the and or as at ,
5 ’s believes thinks asks wants replied tries says v. agrees
6 citizens politicians officials deputy businessmen law-
makers former elected lawyers politician
7 cooperation policy reforms policies funding reform ap-
proval compliance oversight assistance
8 geoff ron doug erik brendan kurt jeremy brad ronnie
yuri
9 love feeling sense answer manner desire romantic emo-
tional but ...
10 have were are but – . and may will
Table 6: Top 10 words projected on the top 10 dimensions
each other. Here, we elaborate the connections
between RSV and HPCA (Lebret and Collobert,
2014) and GloVe (Pennington et al., 2014).
HPCA performs Hellinger transformation fol-
lowed by principal component analysis on co-
occurrence matrix Cas below:
Y=SV T(3)
where Yis the matrix of word vecors, and Sand
Vare matrices of top singular values and right sin-
gular vectors of 2
C. Since the word vectors are to
be used by a neural network, Lebret and Collobert
(2014) recommend to normalize them to avoid the
saturation problem in the network weights (Le-
Cun et al., 2012). Denoting ˜
Yas the empirical
mean of the column vectors in Yand σ(Y)as their
standard deviation, Eq. 4 suggested by Lebret and
Collobert (2014) normalizes the elements of word
vectors to have zero mean and a fixed standard de-
viation of λ1.
Υ = λ(Y˜
Y)
σ(Y)(4)
˜
Yis 0if one centres the column vectors in 2
C
around their mean before performing PCA. Sub-
stituting Eq. 3 into Eq. 4 and the facts that ˜
Y=0
and σ(Y) = 1
n1S, where nis the number of
words, we reach Eq. 1.
In general, one can argue that RSV generalises
the idea of Hellinger transformation used in HPCA
through a set of more general transformation func-
tions. Other differences between RSV and HPCA
are in 1) how they form the co-occurrence matrix
C, and 2) when they centre the data. For each
word wiand each context word wj,Ci,j in RSV
is p(wj|wi), but p(wi|wj)in HPCA. In RSV, the
column vectors of f(C)are centred around their
means before performing SVD, but in HPCA, the
data are centred after performing PCA. In Sec. 6,
we showed that these changes result in significant
improvement in the quality of word vectors.
The connections between RSV and GloVe is as
follows. GloVe extracts word vectors from a co-
occurrence matrix transformed by logarithm func-
tion. Using a global regression model, Pennington
et al. (2014) argue that linear directions of mean-
ings is captured by the matrix of word vectors Υn,k
with following property:
ΥTΥ = log(C) + b1 (5)
where, Cn,n is the co-occurrence matrix, bn,1is a
bias vector, and 11,n is a vector of ones. Denoting
Υias ith column of Υand assuming kΥik= 1 for
i= 1 . . . n, the left-hand side of Eq. 5 measures
the cosine similarity between unit-sized word vec-
tors Υiin a kernel space and the right-hand side
is the corresponding kernel matrix. Using ker-
nel principal component analysis (Sch¨olkopf et al.,
1998), a k-dimensional estimation of Υin Eq. 5 is
Υ = SV T(6)
where Sand Vare the matrices of top singular
values and singular vectors of K. Replacing the
kernel matrix in Eq. 5 with the second degree poly-
nomial kernel K=f(CT)f(C), which measures
the similarities on the basis of the column vectors
defined by the co-occurrence matrix, the word vec-
tors generated by Eq. 6 and Eq. 1 are distributed in
the same directions but with different variances. It
shows that the main difference between RSV and
GloVe is in the kernel matrices they are using.
9 Conclusion
In this paper, we have proposed to form a set of
word vectors from the right singular vectors of a
co-occurrence matrix that is transformed by a 7th-
root transformation function. It has been shown
that the proposed method is closely related to pre-
vious methods of word embedding such as HPCA
and GloVe. Our experiments on the task of de-
pendency parsing show that the parsing models
trained with our word vectors are more accurate
than the parsing models trained with other popular
methods of word embedding.
References
Abolfazl AleAhmad, Hadi Amiri, Ehsan Darrudi, Ma-
soud Rahgozar, and Farhad Oroumchian. 2009.
Hamshahri: A standard persian text collection.
Knowledge-Based Systems, 22(5):382–387.
Taylor Berg-Kirkpatrick, David Burkett, and Dan
Klein. 2012. An empirical investigation of statis-
tical significance in nlp. In Proceedings of the 2012
Joint Conference on Empirical Methods in Natu-
ral Language Processing and Computational Natu-
ral Language Learning, EMNLP-CoNLL ’12, pages
995–1005, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Danqi Chen and Christopher Manning. 2014. A fast
and accurate dependency parser using neural net-
works. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 740–750.
Ronan Collobert, Jason Weston, L´eon Bottou, Michael
Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from
scratch. The Journal of Machine Learning Re-
search, 12:2493–2537.
Marie-Catherine De Marneffe and Christopher D Man-
ning. 2010. Stanford typed dependencies manual
(2008). URL: http://nlp.stanford.edu/
software/dependencies_manual.pdf.
Marie-Catherine De Marneffe, Bill MacCartney,
Christopher D Manning, et al. 2006. Generat-
ing typed dependency parses from phrase structure
parses. In Proceedings of LREC, volume 6, pages
449–454.
Richard Johansson and Pierre Nugues. 2007. Ex-
tended con stituent-to-dependency conversion for en-
glish. In 16th Nordic Conference of Computational
Linguistics, pages 105–112. University of Tartu.
emi Lebret and Ronan Collobert. 2014. Word em-
beddings through hellinger pca. In Proceedings of
the 14th Conference of the European Chapter of the
Association for Computational Linguistics, pages
482–490, Gothenburg, Sweden, April. Association
for Computational Linguistics.
emi Lebret and Ronan Collobert. 2015. Rehabilita-
tion of count-based models for word vector repre-
sentations. In Computational Linguistics and Intel-
ligent Text Processing, pages 417–429. Springer.
Yann A LeCun, eon Bottou, Genevieve B Orr, and
Klaus-Robert M¨uller. 2012. Efficient backprop. In
Neural networks: Tricks of the trade, pages 9–48.
Springer.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated
corpus of English: The Penn treebank. Computa-
tional Linguistics - Special issue on using large cor-
pora, 19(2):313 – 330, June.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word represen-
tations in vector space. In Proceedings of Workshop
at ICLR.
Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
Maltparser: A data-driven parser-generator for de-
pendency parsing. In Proceedings of the 5th In-
ternational Conference on Language Resources and
Evaluation (LREC), pages 2216–2219.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
ter, Yoav Goldberg, Jan Hajic, Christopher D Man-
ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
Natalia Silveira, et al. 2016. Universal dependen-
cies v1: A multilingual treebank collection. In Pro-
ceedings of the 10th International Conference on
Language Resources and Evaluation (LREC 2016).
Joakim Nivre. 2004. Incrementality in deterministic
dependency parsing. In Proceedings of the Work-
shop on Incremental Parsing: Bringing Engineering
and Cognition Together, pages 50–57. Association
for Computational Linguistics.
Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word
representation. In EMNLP, volume 14, pages 1532–
1543.
Bernhard Sch¨olkopf, Alexander Smola, and Klaus-
Robert M¨uller. 1998. Nonlinear component anal-
ysis as a kernel eigenvalue problem. Neural compu-
tation, 10(5):1299–1319.
Mojgan Seraji. 2015. Morphosyntactic Corpora and
Tools for Persian. Ph.D. thesis, Uppsala University.
Milan Straka, Jan Hajic, Jana Strakov´a, and Jan Ha-
jic jr. 2015. Parsing universal dependency treebanks
using neural networks and search-based oracle. In
International Workshop on Treebanks and Linguis-
tic Theories (TLT14), pages 208–220.
Milan Straka, Jan Hajic, and Jana Strakov. 2016.
Udpipe: Trainable pipeline for processing conll-u
files performing tokenization, morphological anal-
ysis, pos tagging and parsing. In Nicoletta Cal-
zolari (Conference Chair), Khalid Choukri, Thierry
Declerck, Sara Goggi, Marko Grobelnik, Bente
Maegaard, Joseph Mariani, Helene Mazo, Asun-
cion Moreno, Jan Odijk, and Stelios Piperidis, edi-
tors, Proceedings of the Tenth International Confer-
ence on Language Resources and Evaluation (LREC
2016), Paris, France, may. European Language Re-
sources Association (ELRA).
A Tropp, N Halko, and PG Martinsson. 2009. Find-
ing structure with randomness: Stochastic algo-
rithms for constructing approximate matrix decom-
positions. Technical report, Applied & Computa-
tional Mmathematics, California Institute of Tech-
nology.
... Our word vectors are generated via RSV (Real-valued Syntactic Word Vectors) model for word embedding (Basirat and Nivre, 2017). RSV extracts a set of word vectors from an unlabeled data in three major steps: First, it builds a co-occurrence matrix whose elements are the frequency of seeing words together. ...
... Within our experiments we set context type and context size as the immediate preceding word, as proposed by (Basirat and Nivre, 2017). The number of dimensions is set to 50. ...
... Through the application of word embedding (Basirat and Nivre, 2017) with various classifiers such as linear discriminant analysis (LDA) and feed-forward neural network (NN), we are able to demonstrate that some types of nominal features can be captured by word embedding models. We show that both grammatical and semantic properties of nouns may be identified correctly through word embedding. ...
Book
We study the presence of linguistically motivated information in the word embeddings generated with statistical methods. The nominal aspects of uter/neuter, common/proper, and count/mass in Swedish are selected to represent respectively grammatical, semantic, and mixed types of nominal categories within languages. Our results indicate that typical grammatical and semantic features are easily captured by word embeddings. The classification of semantic features required significantly less neurons than grammatical features in our experiments based on a single layer feed-forward neural network. However, semantic features also generated higher entropy in the classification output despite its high accuracy. Furthermore, the count/mass distinction resulted in difficulties to the model, even though the quantity of neurons was almost tuned to its maximum.
... Real-valued Syntactic word Vectors (RSV) (Basirat & Nivre, 2017) is a method of word embedding that builds a set of word vectors from right singular vectors of a transformed co-occurrence matrix. RSV uses an nth root transformation function to reshape the distribution of the co-occurrence data. ...
... The word embedding methods can be divided into two major classes: 1) the methods that are developed in the area of distributional semantics (Basirat & Nivre, 2017;Landauer & Dumais, 1997;Lebret & Collobert, 2014;Lund & Burgess, 1996;Pennington et al., 2014;Sahlgren, 2006;Schütze, 1992) and 2) the methods that are developed in the area of language modeling . Levy and Goldberg (2014b) show that both classes are connected to each other. ...
... Levy and Goldberg (2014b) show that both classes are connected to each other. In both areas, a set of word vectors is generated through the application of a dimensionality reduction technique on a co-occurrence matrix, which is built either explicitly (Basirat & Nivre, 2017;Lebret & Collobert, 2014;Pennington et al., 2014) or implicitly Sahlgren, 2006) while scanning a raw corpus. In this section, we review the previous work in the area of distributional semantics since RSV is categorized in this area. ...
Article
Full-text available
We introduce a word embedding method that generates a set of real-valued word vectors from a distributional semantic space. The semantic space is built with a set of context units (words) which are selected by an entropy-based feature selection approach with respect to the certainty involved in their contextual environments. We show that the most predictive context of a target word is its preceding word. An adaptive transformation function is also introduced that reshapes the data distribution to make it suitable for dimensionality reduction techniques. The final low-dimensional word vectors are formed by the singular vectors of a matrix of transformed data. We show that the resulting word vectors are as good as other sets of word vectors generated with popular word embedding methods.
... It encodes syntactic and semantic similarities between the target word and the other existing words in the lexicon (Erk, 2012). In our study, such vector representation is generated via the RSV (Real-valued Syntactic Word Vectors) model for word embedding (Basirat and Nivre, 2017) and fed to the feed-forward neural network, which is a basic architecture for classification task (Haykin, 1998). RSV is a an automatic method of word embedding based on the structure of GloVe (Pennington et al., 2014). ...
Preprint
Full-text available
We analyze the information provided by the word embeddings about the grammatical gender in Swedish. We wish that this paper may serve as one of the bridges to connect the methods of computational linguistics and general linguistics. Taking nominal classification in Swedish as a case study, we first show how the information about grammatical gender in language can be captured by word embedding models and artificial neural networks. Then, we match our results with previous linguistic hypotheses on assignment and usage of grammatical gender in Swedish and analyze the errors made by the computational model from a linguistic perspective.
... In the output layer, the network generates two weight values corresponding to the grammatical gender of the input word. The input word vectors are extracted by the RSV (Real-valued Syntactic Word Vectors) word embedding model (Basirat & Nivre, 2017). Our model relies on two main sources of data, which both originate from the Swedish Language Bank (Språkbanken) located at the University of Gothenburg: a corpus of Swedish raw sentences and a list of nouns affiliated to grammatical genders. ...
... First (embedding), a corpus (raw sentences with segmented words) is fed to the word embedding model, which assigns a vector to each word according to its contexts of occurrence, i.e., which words are preceding and following. In our study, such vector representation is generated via the RSV (Real-valued Syntactic Word Vectors) model for word embeddings (Basirat and Nivre, 2017), which is an automatic method of word embedding based on the structure of GloVe (Pennington et al., 2014). In the second step (labeling), the list of word vectors associated with the nouns are labeled with their grammatical gender based on the dictionary. ...
Book
Categorization is one the most relevant tasks realized by humans during their life, as we consistently need to categorize the things and experience that we encounter. Such need is reflected in language via various mechanisms, the most prominent being nominal classification systems (e.g., grammatical gender such as the masculine/feminine distinction in French). Typological methods are used to investigate the underlying functions and structures of such systems, using a wide variety of cross-linguistic data to examine universality and variability. This analysis is itself a classification task, as languages are categorized and clustered according to their grammatical features. This thesis provides a cross-linguistic typological analysis of nominal classification systems and in parallel compares a number of quantitative methods that can be applied at different scales. First, this thesis provides an analysis of nominal classification systems (i.e., gender and classifiers) via the description of three languages with respectively gender, classifiers, and both. While the analysis of the first two languages are more of a descriptive nature and aligns with findings in the existing literature, the third language provides novel insights to the typology of nominal classification systems by demonstrating how classifiers and gender may co-occur in one language in terms of distribution of functions. Second, the underlying logic of nominal classification systems is commonly considered difficult to investigate, e.g., is there a consistent logic behind gender assignment in language? is it possible to explain the distribution of classifier languages of the world while taking into account geographical and genealogical effects? This thesis addresses the lack of arbitrariness of nominal classification systems at three different scales: The distribution of classifiers at the worldwide level, the presence of gender within a language family, and gender assignment at the language-internal level. The methods of random forests, phylogenetics, and word embeddings with neural networks are selected since they are respectively applicable at three different scales of research questions (worldwide, family-internal, language-internal).
... See Section 2 of de Lhoneux et al. (2017a) for more details. We constructed word embeddings based on the RSV model of Basirat and Nivre (2017), using universal part-of-speech tags as contexts, see Section 4 of de Lhoneux et al. (2017a) for more details. Our word vectors are thus constructed as such: each word x i is represented by a randomly initialised embedding of the word e(w i ), a pre-trained embedding pe(w i ) and a character vector ce(w i ): ...
Thesis
Full-text available
This thesis presents several studies in neural dependency parsing for typologically diverse languages, using treebanks from Universal Dependencies (UD). The focus is on informing models with linguistic knowledge. We first extend a parser to work well on typologically diverse languages, including morphologically complex languages and languages whose treebanks have a high ratio of non-projective sentences, a notorious difficulty in dependency parsing. We propose a general methodology where we sample a representative subset of UD treebanks for parser development and evaluation. Our parser uses recurrent neural networks which construct information sequentially, and we study the incorporation of a recursive neural network layer in our parser. This follows the intuition that language is hierarchical. This layer turns out to be superfluous in our parser and we study its interaction with other parts of the network. We subsequently study transitivity and agreement information learned by our parser for auxiliary verb constructions (AVCs). We suggest that a parser should learn similar information about AVCs as it learns for finite main verbs. This is motivated by work in theoretical dependency grammar. Our parser learns different information about these two if we do not augment it with a recursive layer, but similar information if we do, indicating that there may be benefits from using that layer and we may not yet have found the best way to incorporate it in our parser. We finally investigate polyglot parsing. Training one model for multiple related languages leads to substantial improvements in parsing accuracy over a monolingual baseline. We also study different parameter sharing strategies for related and unrelated languages. Sharing parameters that partially abstract away from word order appears to be beneficial in both cases but sharing parameters that represent words and characters is more beneficial for related than unrelated languages.
... The y-axis indicates the total ratio. The x-axis represents the nouns of the corpus partitioned into ten groups by their descending frequency Word vectors are generated by Real-valued Syntactic Word Vectors (RSV) (Basirat and Nivre, 2017) and fed to a feed-forward neural network, which is used as the classifier. The parameters of the model are set as window size one with asymmetric-backward window type. ...
Conference Paper
Full-text available
We study the presence of information provided by word embeddings from real-valued syntactic word vectors for determining the grammatical gender of nouns in Swedish. Our investigation reveals that regardless of being a frequently used word or not, real-valued syntactic word vectors are highly informative for identifying the grammatical gender of nouns. By using a neural network classifier we show that the uncertainty involved in the output of the network is only weakly correlated with the frequency level of words. Moreover, a linguistic analysis of errors demonstrates that while half of the errors can be avoided by using POS tag of words, the remaining errors are linguistically motivated and require extra information about the context of words.
... Different word embedding methods proposed in literature can be divided into two main categories: 1) methods that are developed in the area of distributional semantics (Schütze, 1992;Lund and Burgess, 1996;Landauer and Dumais, 1997;Sahlgren, 2006;Pennington et al., 2014;Lebret and Collobert, 2014;Basirat and Nivre, 2017), and 2) methods that are developed in the area of language modelling (Bengio et al., 2003;Collobert et al., 2011;Mikolov et al., 2013a). Levy and Goldberg (2014) show that these methods are highly connected to each other. ...
Conference Paper
Full-text available
Word embeddings are fundamental objects in neural natural language processing approaches. Despite the fact that word embedding methods follow the same principles, we see in practice that most of the methods that use PCA are not as successful as the methods that are developed in the area of language modelling and make use of neural networks to train word embeddings. In this paper, we address the limiting factors of PCA for word embedding and propose solutions to mitigate those factors. Our experimental results show that principal word embeddings generated with our approach are better than or as good as other sets of word embeddings when they are used in different NLP tasks.
Article
Text-related soft information effectively alleviates the information asymmetry associated with P2P lending and reduces credit risk. Most existing studies use nonsemantic text information to construct credit evaluation models and predict the borrower's level of risk. However, the semantic information also reflect the ability and willingness of borrowers to repay and might be able to explain borrowers’ credit statuses. This paper examines whether semantic loan description text information helps predict the credit risk of different types of borrowers using a Chinese P2P platform. We use the 5P credit evaluation theory and the word embedding model to extract the semantic features of loan descriptions across five dimensions. Then, the AdaBoost ensemble learning strategy is applied to construct a credit evaluation model to improve the learning performance of an intelligent algorithm. The extracted semantic features are integrated into the evaluation model to study their explanatory ability with regard to the credit status of different types of borrowers. We conducted empirical research on the Renrendai P2P platform. Our conclusions show that the semantic features of textual soft information significantly improve the predictability of credit evaluation models and that the promotion effect is most significant for first-time borrowers. This paper has important practical significance for P2P platforms and the credit risk management of lenders. Furthermore, it has theoretical value for research concerning heterogeneous information-based credit risk analysis methods in big data environments.
Chapter
We study the presence of linguistically motivated information in the word embeddings generated with statistical methods. The nominal aspects of uter/neuter, common/proper, and count/mass in Swedish are selected to represent respectively grammatical, semantic, and mixed types of nominal categories within languages. Our results indicate that typical grammatical and semantic features are easily captured by word embeddings. The classification of semantic features required significantly less neurons than grammatical features in our experiments based on a single layer feed-forward neural network. However, semantic features also generated higher entropy in the classification output despite its high accuracy. Furthermore, the count/mass distinction resulted in difficulties to the model, even though the quantity of neurons was almost tuned to its maximum.
Book
Full-text available
The book is available electronically at the following address: http://uu.diva-portal.org/smash/record.jsf?pid=diva2%3A800998&dswid=-1537
Conference Paper
Full-text available
Recent works on word representations mostly rely on predictive models. Distributed word representations (aka word embeddings) are trained to optimally predict the contexts in which the corresponding words tend to appear. Such models have succeeded in capturing word similarties as well as semantic and syntactic regularities. Instead, we aim at reviving interest in a model based on counts. We present a systematic study of the use of the Hellinger distance to extract semantic representations from the word co-occurence statistics of large text corpora. We show that this distance gives good performance on word similarity and analogy tasks, with a proper type and size of context, and a dimensionality reduction based on a stochastic low-rank approximation. Besides being both simple and intuitive, this method also provides an encoding function which can be used to infer unseen words or phrases. This becomes a clear advantage compared to predictive models which must train these new words.
Article
Full-text available
We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Conference Paper
Automatic natural language processing of large texts often presents recurring challenges in multiple languages: even for most advanced tasks, the texts are first processed by basic processing steps – from tokenization to parsing. We present an extremely simple-to-use tool consisting of one binary and one model (per language), which performs these tasks for multiple languages without the need for any other external data. UDPipe, a pipeline processing CoNLL-U-formatted files, performs tokenization, morphological analysis, part-of-speech tagging, lemmatization and dependency parsing for nearly all treebanks of Universal Dependencies 1.2 (namely, the whole pipeline is currently available for 32 out of 37 treebanks). In addition, the pipeline is easily trainable with training data in CoNLL-U format (and in some cases also with additional raw corpora) and requires minimal linguistic knowledge on the users’ part. The training code is also released.
Conference Paper
We describe a transition-based, non-projective dependency parser which uses a neural network classifier for prediction and requires no feature engineering. We propose a new, search-based oracle, which improves parsing accuracy similarly to a dynamic oracle, but is applicable to any transition system, such as the fully non-projective swap system. The parser has excellent parsing speed, compact models, and achieves high accuracy without requiring any additional resources such as raw corpora. We tested it on all 19 treebanks of the Universal Dependencies project. The C++ implementation of the parser is being released as an open-source tool.
Conference Paper
We investigate two aspects of the empirical behavior of paired significance tests for NLP systems. First, when one system appears to outperform another, how does significance level relate in practice to the magnitude of the gain, to the size of the test set, to the similarity of the systems, and so on? Is it true that for each task there is a gain which roughly implies significance? We explore these issues across a range of NLP tasks using both large collections of past systems' outputs and variants of single systems. Next, once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed, such as across domains? We explore this question using a range of test set variations for constituency parsing.