Content uploaded by Ali Basirat
Author content
All content in this area was uploaded by Ali Basirat on Jun 15, 2017
Content may be subject to copyright.
Real-valued Syntactic Word Vectors (RSV) for Greedy Neural
Dependency Parsing
Ali Basirat and Joakim Nivre
Department of Linguistics and Philology
Uppsala University
{ali.basirat,joakim.nivre}@lingfil.uu.se
Abstract
We show that a set of real-valued word
vectors formed by right singular vectors
of a transformed co-occurrence matrix are
meaningful for determining different types
of dependency relations between words.
Our experimental results on the task of de-
pendency parsing confirm the superiority
of the word vectors to the other sets of
word vectors generated by popular meth-
ods of word embedding. We also study
the effect of using these vectors on the
accuracy of dependency parsing in differ-
ent languages versus using more complex
parsing architectures.
1 Introduction
Greedy transition-based dependency parsing is ap-
pealing thanks to its efficiency, deriving a parse
tree for a sentence in linear time using a discrimi-
native classifier. Among different methods of clas-
sification used in a greedy dependency parser, neu-
ral network models capable of using real-valued
vector representations of words, called word vec-
tors, have shown significant improvements in both
accuracy and speed of parsing. It was first pro-
posed by Chen and Manning (2014) to use word
vectors in a 3-layered feed-forward neural network
as the core classifier in a transition-based depen-
dency parser. The classifier is trained by the stan-
dard back-propagation algorithm. Using a limited
number of features defined over a certain number
of elements in a parser configuration, they could
build an efficient and accurate parser, called the
Stanford neural dependency parser. This archi-
tecture then was extended by Straka et al. (2015)
and Straka et al. (2016). Parsito (Straka et al.,
2015) adds a search-based oracle and a set of mor-
phological features to the original architecture in
order to make it capable of parsing the corpus of
universal dependencies. UDPipe (Straka et al.,
2016) adds a beam search decoding to Parsito
in order to improve the parsing accuracy at the cost
of decreasing the parsing speed.
We propose to improve the parsing accuracy
in the architecture introduced by Chen and Man-
ning (2014) through using more informative word
vectors. The idea is based on the greedy nature
of the back-propagation algorithm which makes
it highly sensitive to the initial state of the algo-
rithm. Thus, it is expected that more qualified
word vectors positively affect the parsing accu-
racy. The word vectors in our approach are formed
by right singular vectors of a matrix returned by
a transformation function that takes a probability
co-occurrence matrix as input and expand the data
massed around zero. We show how the proposed
method is related to HPCA (Lebret and Collobert,
2014) and GloVe (Pennington et al., 2014).
Using these word vectors with the Stanford
parser we could obtain the parsing accuracy of
93.0% UAS and 91.7% LAS on Wall Street Jour-
nal (Marcus et al., 1993). The word vectors con-
sistently improve the parsing models trained with
different types of dependencies in different lan-
guages. Our experimental results show that pars-
ing models trained with Stanford parser can be as
accurate or in some cases more accurate than other
parsers such as Parsito, and UDPipe.
2 Transition-Based Dependency Parsing
A greedy transition-based dependency parser de-
rives a parse tree from a sentence by predicting a
sequence of transitions between a set of configu-
rations characterized by a triple c= (Σ, B, A),
where Σis a stack that stores partially processed
nodes, Bis a buffer that stores unprocessed nodes
in the input sentence, and Ais a set of partial
parse trees assigned to the processed nodes. Nodes
are positive integers corresponding to linear posi-
tions of words in the input sentence. The process
of parsing starts from an initial configuration and
ends with some terminal configuration. The tran-
sitions between configurations are controlled by a
classifier trained on a history-based feature model
which combines features of the partially built de-
pendency tree and attributes of input tokens.
The arc-standard algorithm (Nivre, 2004) is
among the many different algorithms proposed for
moving between configurations. The algorithm
starts with the initial configuration in which all
words are in B,Σis empty, and Aholds an ar-
tificial node 0. It uses three actions Shift,Right-
Arc, and Left-Arc to transition between the config-
urations and build the parse tree. Shift pushes the
head node in the buffer into the stack uncondition-
ally. The two actions Left-Arc and Right-Arc are
used to build left and right dependencies, respec-
tively, and are restricted by the fact that the final
dependency tree has to be rooted at node 0.
3 Stanford Dependency Parser
The Stanford dependency parser can be consid-
ered as a turning point in the history of greedy
transition-based dependency parsing. The parser
could significantly improve both the accuracy and
speed of dependency parsing. The key success of
the parser can be summarized in two points: 1) ac-
curacy is improved by using a neural network with
pre-trained word vectors, and 2) efficiency is im-
proved by pre-computation to keep the most fre-
quent computations in the memory.
The parser is an arc-standard system with a
feed-forward neural-network as its classifier. The
neural network consists of three layers: An in-
put layer connects the network to a configuration
through 3real-valued vectors representing words,
POS tags, and dependency relations. The vec-
tors that represent POS-tags and dependency re-
lations are initialized randomly but those that rep-
resent words are initialized by word vectors sys-
tematically extracted from a corpus. Each of these
vectors are independently connected to the hidden
layer of the network through three distinct weight
matrices. A cube activation function is used in
the hidden layer to model the interactions between
the elements of the vectors. The activation func-
tion resembles a third degree polynomial kernel
that enables the network to take different combi-
nations of vector elements into consideration. The
output layer generates probabilities for decisions
between different actions in the arc-standard sys-
tem. The network is trained by the standard back-
propagation algorithm that updates both network
weights and vectors used in the input layer.
4 Word Vectors
Dense vector representations of words, in this pa-
per known as word vectors, have shown great im-
provements in natural language processing tasks.
An advantage of this representation compared to
the traditional one-hot representation is that the
word vectors are enriched with information about
the distribution of words in different contexts.
Following Lebret and Collobert (2014), we pro-
pose to extract word vectors from a co-occurrence
matrix as follows: First we build a co-occurrence
matrix Cfrom a text. The element Ci,j is a max-
imum likelihood estimation of the probability of
seeing word wjin the context of word wi, (i.e.,
Ci,j =p(wj|wi). It results in a sparse matrix
whose data are massed around zero because of the
disproportional contribution of the high frequency
words in estimating the co-occurrence probabili-
ties. Each column of Ccan be seen as a vector in
a high-dimensional space whose dimensions cor-
respond to the context words. In practice, we need
to reduce the dimensionality of these vectors. This
can be done by standard methods of dimensional-
ity reduction such as principal component analy-
sis. However, the high density of data around zero
and the presence of a small number of data points
far from the data mass can lead to some meaning-
less discrepancies between word vectors.
In order to have a better representation of the
data we expand the probability values in Cby
skewing the data mass from zero toward one. This
can be done by any monotonically increasing con-
cave function that magnifies small numbers in its
domain while preserving the given order. Com-
monly used transformation functions with these
characteristics are the logarithm function, the hy-
perbolic tangent, the power transformation, and
the Box-cox transformation with some specific
parameters. Fig. 1 shows how a transformation
function expands high-dimensional word vectors
massed around zero.
After applying fon C, we centre the column
vectors in f(C)around their mean and build the
word vectors as below:
Υ = γVT
n,k (1)
where Υis a matrix of kdimensional word vec-
tors associated with nwords, VT
n,k is the top k
-10 -8 -6 -4 -2 0 2
-10
-8
-6
-4
-2
0
2
4
the
,
.
of
NUMBER
to
andin a
for ''
for '' ‘‘
that is
on
's
was
-lrb-
-rrb-
with as
by ithe
at said
from his
be
are an
(a)
-70 -60 -50 -40 -30 -20 -10 0 10
-20
-15
-10
-5
0
5
10
the
,.of
NUMBER
to
and
in
a
for''
‘‘
that
is
on 's
was
-lrb-
-rrb-
with
as by
it
he
at
said
from
his
be
are
an
are
an
;
have
:
has
but
this
not i
were
they
had
who
or
which
their
will
--
its
new
one
after
been
also
we
more
would
about
-
you
first
two
up
when
n't
'
all
there
her
she
other
out
people
can
than
year
into
some
over
percent
do
if
time
$
last
no
years
_
government
world
only
what
could
most
so
them
may
him
president
three
million
u.s.
state
united
many
againstlike
during
before
where
now
did
just
because
while
us
states
between
since
made
city
such
then
national
through
under
any
my
down
these
being
group
company
back
country
day
‘
news
american
even
york
both
second
south our those
including
part
well
get
still make
war
school
work
off
...
international
team
used
how
very
week
much
way
here
four
%
=
end
later
should
china
minister
see
home
house
another
?
billion
NUMBERth
take
former
police
around
high
next
times
told
officials
me
party
north
tuesday
called
use
same
say
wednesday
public market
friday
thursday
monday
several
game
until
bank
going
good
think
says
your
military
early
number
each
season
reuters
however
family
set
go
according
university
does
five
long
following
political
found
public market
friday
thursday
monday
several
game
until
bank
going
good
think
says
your
military
early
number
each
season
reuters
however
family
set
go
according
university
does
five
long
following
political
found
known
life
security
left
service
major
know
own
general
place
top
foreign
right
court
area
points
-rsb-
took
-lsb-
local
won
john days
members
among
business
economic
best
system
help
washington
women
series
west
want
march
came
report
children
show
sunday
without
name
official
power
months
/
support
few
little
became held
come
month
man
countries
film
oil
dollars
|
never
men
white companies
law
third
june
six
expected
moved
iraq
de
central NUMBERs
european
killed
too
began
british
reported
late
every
games
play
office
saturday
july
east
put
league
great
case
money bush
today
town
center
final
small
financial
played
open
april
added
chief
NUMBER.NUMBER
least
trade
released
meeting
line
within &
got
need
again
information
air
past
music county
're
federal
program
along
head
big
although
death
went
must
history
point
capital
might
side
become
night
army
order
japan
born
due
force
health
far
near
close
ago
forces
french
prime
earlier away
old
though
union
win
saying
.NUMBER
january
district
leader
based
prices
large
development
department
often
september
region
move
record
lost
lead
france
water
israel
main
whether
led
director
member
less
others
plan
chinese
age
already
recent
different
run
story december
cup
once
economy
control
category
NUMBER-NUMBER
given
better
rights
america
october
peace
statement
council
give
young
price
per
total
half
deal
november
died
making
important
india
london
campaign
leaders russia
stock
shares
press
building
further
population
election
hit
having
station
change
august
lot
club
college
talks
research
book
students
sales
black
seen
really
television
include
conference
human
NUMBER-NUMBER:NUMBER
community
street
industry troops
start
call
issue
find
taken
agency
rate
look
announced
asked
something
defense
bill
clinton
free
attack
committee
across
current
working
church
rose
interest
media
europe
services
received
album
germany
results
started
-NUMBER
real san
february
list
special
index
decision
things
spokesman
players
growth
southern
administration
keep
enough
fell
period
policy
site
named
despite
using
possible
groups
role
board
average
process
together
weeks
almost
seven
level
return
'm
german
nations
cut
behind
football
words
father
've
NUMBER-year-old
yet
whose
strong
england
plans
israeli
northern
road australia
share
social
higher
full
russian
live
western
energy
food future
trying
david
always
outside
land
form
car
single
island
able
george
nearly
nation
issues
river
park
production
largest
likely
taking
global
data century
himself
why
round
race
nuclear
king
field
hours
education
son
areas
visit
attacks
california
fire
workers
key
player
action
!
face
japanese
leading
eight
exchange
ever
agreement
included
vote minutes
career africa
gave
red congress
lower
senior
available
korea
palestinian
markets
match
association
democratic
built
problems
executivevillage
located
english
art
result
pay
similar
thought
secretary
job
ministry
believe
front
continue
private
band
file
position
sent
radio
clear
project
body
love
published
living
increase
short
previous
done
authorities
reports
middle
*
hard
dollar
wanted
title
>
course
opposition
ca
iran elections
coach
miles
problem
commission
served
act
province
playing
los
daily
mother
release
shot
wife
low
soon
division
ii
woman
fact
canada
michael
study
rates
meet
means
fall
britain
post
song thing
provide
victory
worked
saw
(b)
Figure 1: PCA visualization of the high-dimensional
column-vectors of a co-occurrence matrix (a) before and (b)
after applying the transformation function 10
√x
right singular vectors of f(C), and γ=λ√nis a
constant factor to scale the unbounded data in the
word vectors. In the following, we will refer to our
model as RSV, standing for Right Singular word
Vector or Real-valued Syntactic word Vectors as it
is mentioned in the title of this paper.
5 Experimental Setting
We restrict our experiments to three languages,
English, Swedish, and Persian. Our experiments
on English are organized as follows: Using dif-
ferent types of transformation functions, we first
extract a set of word vectors that gives the best
parsing accuracy on our development set. Then we
study the effect of dimensionality on parsing per-
formance. Finally, we give a comparison between
our best results and the results obtained from other
sets of word vectors generated with popular meth-
ods of word embedding. Using the best transfor-
mation function obtained for English, we extract
word vectors for Swedish and Persian. These word
vectors are then used to train parsing models on
the corpus of universal dependencies.
The English word vectors are extracted from
a corpus consisting of raw sentences in Wall
Street Journal (WSJ) (Marcus et al., 1993), En-
glish Wikicorpus,1Thomson Reuters Text Re-
1http://www.cs.upc.edu/˜nlp/wikicorpus/
search Collection (TRC2), English Wikipedia cor-
pus,2, and the Linguistic Data Consortium (LDC)
corpus. We concatenate all the corpora and split
the sentences by the OpenNLP sentence splitting
tool.The Stanford tokenizer is used for tokeniza-
tion. Word vectors for Persian are extracted from
the Hamshahri Corpus (AleAhmad et al., 2009),
Tehran Monolingual Corpus,3and Farsi Wikipedia
download from Wikipedia Monolingual Corpora.4
The Persian text normalizer tool (Seraji, 2015) is
used for sentence splitting and tokenization.5Word
vectors for Swedish are extracted from Swedish
Wikipedia available at Wikipedia Monolingual
Corpora, Swedish web news corpora (2001-2013)
and Swedish Wikipedia corpus collected by Sprk-
banken. 6The OpenNLP sentence splitter and to-
kenizer are used for normalizing the corpora.
We replace all numbers with a special token
NUMBER and convert uppercase letters to lower-
case forms in English and Swedish. Word vectors
are extracted only for the unique words appearing
at least 100 times. We choose the cut-off word
frequency of 100 because it is commonly used as
a standard threshold in the other references. The
10 000 most frequent words are used as context
words in the co-occurrence matrix. Table 1 rep-
resents some statistics of the corpora.
#Tokens #W≥1#W≥100 #Sents
English 8×10914 462 600 404 427 4 ×108
Persian 4×1081 926 233 60 718 1 ×107
Swedish 6×1085 437 176 174 538 5 ×107
Table 1: Size of the corpora from which word vectors are
extracted; #Tokens: total number of tokens; #W≥k: number
of unique words appearing at least ktimes in the corpora;
#Sents: number of sentences.
The word vectors are evaluated with respect
to the accuracy of parsing models trained with
them using the Stanford neural dependency parser
(Chen and Manning, 2014). The English parsing
models are trained and evaluated on the corpus of
universal dependencies (Nivre et al., 2016) version
1.2(UD) and Wall Street Journal (WSJ) (Marcus
et al., 1993) annotated with Stanford typed depen-
dencies (SD) (De Marneffe and Manning, 2010)
and CoNLL syntactic dependencies (CD) (Johans-
2https://dumps.wikimedia.org/enwiki/latest/
enwiki-latest- pages- articles.xml.bz2- rss.xml
3http://ece.ut.ac.ir/system/files/NLP/Resources/
4http://linguatools.org/tools/corpora/
wikipedia-monolingual- corpora
5http://stp.lingfil.uu.se/˜mojgan
6https://spraakbanken.gu.se/eng/resources/corpus
son and Nugues, 2007). We split WSJ as follow:
sections 02–21 for training, section 22 for devel-
opment, and section 23 as test set. The Stanford
conversion tool (De Marneffe et al., 2006) and the
LTH conversion tool7are used for converting con-
stituency trees in WSJ to SD and CD. The Swedish
and Persian parsing models are trained on the cor-
pus of universal dependencies. All the parsing
models are trained with gold POS tags unless we
clearly mention that predicted POS tags are used.
6 Results
In the following we study how word vectors gener-
ated by RSV influence parsing performance. RSV
has four main tuning parameters: 1) the context
window, 2) the transformation function f, 3) the
parameter λused in the normalization step, and
4) the number of dimensions. The context win-
dow can be symmetric or asymmetric with dif-
ferent length. We choose the asymmetric context
window with length 1i.e., the first preceding word,
as it is suggested by Lebret and Collobert (2015)
for syntactic tasks. λis a task dependent param-
eter that controls the variance of word vectors. In
order to find the best value of λ, we train the parser
with different sets of word vectors generated ran-
domly by Gaussian distributions with zero mean
and isotropic covariance matrices λIwith values
of λ∈(0,1]. Fig. 2a shows the parsing accura-
cies obtained from these word vectors. The best
results are obtained from word vectors generated
by λ= 0.1and λ= 0.01. The variation in the re-
sults shows the importance of the variance of word
vectors on the accuracy of parser. Regarding this
argument, we set the normalization parameter λin
Eq. 1 equal to 0.1. The two remaining parameters
are explored in the following subsections.
6.1 Transformation Function
We have tested four sets of transformation func-
tions on the co-occurrence matrix:
•f1= tanh(nx)
•f2=n
√x
•f3=n(n
√x−1)
•f4=log(2n+1x+1)
log(2n+1+1)
where x∈[0,1] is an element of the matrix and
nis a natural number that controls the degree of
skewness, the higher the value of nis, the more
7http://nlp.cs.lth.se/software/
treebank-converter/
the data will be skewed. Fig. 2b shows the effect
of using these transformation functions on parsing
accuracy. Best results are obtained from nth-root
and Box-cox transformation functions with n= 7,
which are closely related to each other. Denoting
the set of word vectors obtained from the transfor-
mation function fas Υ(f), it can be shown that
Υ(f3) = Υ(f2)−n, since the effect of coefficient
nin the first term of f3is cancelled out by the right
singular vectors in Eq. 1.
Fig. 2c visualizes best transformation functions
in each of the function sets. All the functions
share the same property of having relatively high
derivatives around 0which allows of skewing data
in the co-occurrence matrix to right (i.e., close to
one) and making a clear gap between the syntactic
structures that can happen (i.e., non-zeros proba-
bilities in the co-occurrence matrix) and those that
cannot happen (i.e., zero probabilities in the co-
occurrence matrix. This movement from zero to
one, however, can lead to disproportional expan-
sions between the data close to zero and those that
are close to one. It is because the limit of the ra-
tio of the derivative of fi(x)i= 1,2,3,4to the
derivative of fi(y)as xapproaches to 0and yap-
proaches to 1is infinity i.e., limx→0,y→1
f′
i(x)
f′
i(y)=
∞. The almost uniform behaviour of f1(x)for
x > 0.4results in a small variance in the generated
data that will be ignored by the subsequent sin-
gular value decomposition step and consequently
loses the information provided with the most prob-
able context words. Our solution to these prob-
lems is to use the following piecewise transforma-
tion function f:
f=tanh(7x)x≤θ
7
√x x > θ (2)
where θ= 10−nand n∈N. This function ex-
pands the data in a more controlled manner (i.e.,
limx→0,y→1f′(x)
f′(y)= 49) with less lost in informa-
tion provided with the variance of data. Using this
function with θ= 10−7, we could get UAS of 92.3
and LAS of 90.9on WSJ development set anno-
tated with Stanford typed dependencies, which is
slightly better than other transformation functions
(see Fig. 2b). We obtain UAS of 92.0and LAS of
90.6on the WSJ test set with the same setting.
6.2 Dimensionality
The dimensionality of word vectors is determined
by the number of singular vectors used in Eq. 1.
1e-05
0.0001
0.001
0.01
0.1
0.2
1
90.2
90.4
90.6
90.8
91
91.2
91.4
91.6
(a)
12345678910
91
91.2
91.4
91.6
91.8
92
92.2
92.4
f1
f2
f3
f4
f
(b)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
f1= tanh(7x)
f2=7
√x
1
7(f3+ 7) = 7
√x
f4=log(29x+1)
29+1
(c)
Figure 2: Unlabelled attachment score of parsing models trained with (a) the randomly generated word vectors, and (b)
the systematically extracted word vectors using different transformation functions. The experiments are carried out with 50
dimensional word vectors. Parsing models are evaluated on the development set in Wall Street Journal annotated with Satnford
typed dependencies (SD). fin (b) is the piecewise transformation function shown in Eq. 2. c: the transformation functions in
each function set resulting in the best parsing models. The vertical axis shows the data in a probability co-occurrence matrix
and the vertical axis shows their projection after transformation. For better visualization, the range of f3is scaled to [0,1].
High dimensional word vectors are expected to
result in higher parsing accuracies. It is because
they can capture more information from the origi-
nal data, i.e., the Frobenius norm of the deference
between the original matrix and its truncated es-
timation depends on the number of top singular
vectors used for constructing the truncated ma-
trix. This achievement, however, is at the cost
of more computational resources a) to extract the
word vectors, and b) to process the word vectors
by parser. The most expensive step to extract the
word vectors is the singular value decomposition
of the transformed co-occurrence matrix. Using
the randomized SVD method described by Tropp
et al. (2009), the extraction of ktop singular vec-
tors of an m×nmatrix requires O(mn log(k))
floating point operations. It shows that the cost for
having larger word vectors grows logarithmically
with the number of dimensions.
The parsing performance is affected by the di-
mensionality of the word vectors, fed into the in-
put layer of the neural network, in two ways: First,
higher number of dimensions in the input layer
lead to a larger weight matrix between the input
layer and the hidden layer. Second, larger hid-
den layer is needed to capture the dependencies
between the elements in the input layer. Given
a set of word vectors with kdimensions con-
nected to the hidden layer with hhidden units, the
weight matrix between the input layer and the hid-
den layer grows with the scale of O(kh), and the
weight matrix between the hidden layer and the
output layer grows with the scale of O(h). For
each input vector, the back-propagation algorithm
passes the weight matrices three times per iteration
1) to forward each input vector through the net-
work, 2) to back propagate the errors, generated
by the inputs, and 3) to update the network param-
eters. So, each input vector needs O(3(kh +h)))
time to be processed by the algorithm. Given the
trained model, the output signals are generated
through only one forward pass.
Table 2 shows how high dimensional word vec-
tors affect the parsing performance. In general,
increasing the number of hidden units leads to a
more accurate parsing model at a linear cost of
parsing speed. Increasing the dimensionality of
word vectors to 200 dimensions consistently in-
creases the parsing accuracy at again the linear
time of parsing speed. However, increasing both
the dimensionality of word vectors and the size of
the hidden layer leads to a quadratic decrease in
parsing speed. The best results are obtained from
the parsing model trained with 100-dimensional
word vectors and 400 hidden units, resulting in the
parsing accuracy of 93.0UAS and 91.7LAS on
our test set, +1.0UAS and +1.1LAS improve-
ment over what we obtained with 50 dimensional
word vectors. It is obtained at the cost of 47% re-
duction in the parsing speed.
6.3 Comparison and Consistency
We evaluate the RSV word vectors on different
types of dependency representations and differ-
ent languages. Table 3 gives a comparison be-
tween RSV and different methods of word embed-
ding with respect to their contributions to depen-
dency parsing and the time required to generate
word vectors for English. All the parsing mod-
els are trained with 400 hidden units and 100-
dimensional word vectors extracted from English
raw corpus described in Sec. 5. The word vec-
h→200 300 400
↓k UAS LAS P UAS LAS P UAS LAS P
50 92.3 90.9 392 92.9 91.5 307 93.0 91.6 237
100 92.6 91.2 365 92.9 91.5 263 93.1 91.8 206
150 92.6 91.2 321 92.9 91.5 236 93.1 91.8 186
200 92.7 91.3 310 93.1 91.7 212 93.1 91.8 165
250 92.7 91.2 286 92.9 91.5 201 93.0 91.7 146
300 92.6 91.2 265 92.9 91.6 180 92.9 91.5 119
350 92.7 91.2 238 92.8 91.4 174 92.9 91.5 111
400 92.6 91.2 235 92.8 91.3 141 93.0 91.5 97
Table 2: The performance of parsing models trained with
k-dimensional word vectors and hhidden units. The pars-
ing accuracies (UAS, LAS) are for the development set. P:
parsing speed (sentence/second).
tors are extracted by a Linux machine running on
12 CPU cores. The free parameters of the word
embedding methods, i.e., context type and con-
text size, have been tuned on the development
set and the best settings, resulting in the high-
est parsing accuracies, were then chosen for com-
parison. It leads us to asymmetric window of
length 1for RSV,GloVe and HPCA, and sym-
metric window of length 1for word2vec mod-
els, CBOW and SkipGram. The GloVe word vec-
tors are extracted by available implementation of
GloVe (Pennington et al., 2014) running for 50
iterations. The HPCA word vectors are extracted
by our implementation of the method (Lebret and
Collobert, 2014). CBOW and SkipGram word
vectors are extracted by available implementation
of word2vec (Mikolov et al., 2013) running for
10 iterations, and a negative sampling value of 5.
In order to show the validity of the results, we per-
form a bootstrap statistical significance test (Berg-
Kirkpatrick et al., 2012) on the results obtained
from each parsing experiment and RSV with the
null hypothesis H0:RSV is no better than the
model B, where Bcan be any of the word embed-
ding methods. The resulting p-values are reported
together with the parsing accuracies.
The empirical results show that HPCA,RSV, and
GloVe are ranked as fastest methods of word em-
bedding in order of time. The reason why these
methods are faster than word2vec is because
they scan the corpus only one time and then store
it as a co-occurrence matrix in memory. The rea-
son why HPCA is faster than RSV is because HPCA
stores the co-occurrence matrix as a sparse matrix
but RSV stores it as a full matrix. This expense
makes RSV more qualified than HPCA when they
are used in the task of dependency parsing.
The results obtained from RSV word vectors are
Model Time
SD CD UD
UAS LAS UAS LAS UAS LAS
p-val p-val p-val p-val p-val p-val
CBOW 8741 93.0 91.5 93.4 92.6 88.0 85.4
0.00 0.00 0.02 0.00 0.00 0.00
SGram 11113 93.0 91.6 93.4 92.5 87.4 84.9
0.06 0.02 0.00 0.00 0.00 0.00
GloVe 3150 92.9 91.6 93.5 92.6 88.4 85.8
0.02 0.02 0.04 0.06 0.54 0.38
HPCA 2749 92.1 90.8 92.5 91.7 86.6 84.0
0.00 0.00 0.00 0.00 0.00 0.00
RSV 2859 93.1 91.8 93.6 92.8 88.4 85.9
Table 3: Performance of word embedding methods: Qual-
ity of word vectors are measured with respect to parsing mod-
els trained with them. The efficiency of models is measured
with respect to the time (seconds) required to extract a set of
word vectors. Parsing models are evaluated on our English
development set; SGram: SkipGram, SD: Stanford typed De-
pendencies, CD: CoNLL Dependencies, UD: Universal De-
pendencies, and p-val: p-value of the null hypothesis: RSV is
no better than the word embedding method corresponding to
each cell of the table.
comparable and slightly better than other sets of
word vectors. The difference between RSV and
other methods is more clear when one looks at the
difference between the labelled attachment scores.
Apart from the parsing experiment with GloVe on
the universal dependencies, the relatively small
p-values reject our null hypothesis and confirms
that RSV can result in more qualified word vectors
for the task of dependency parsing. In addition
to this, the constant superiority of the results ob-
tained from RSV on different dependency styles is
an evidence that the results are statistically signif-
icant, i.e., the victory of RSV is not due merely
to chance. Among the methods of word embed-
ding, we see that the results obtained from GloVe
are more close to RSV, especially when they come
with universal dependencies. We show in Sec. 8
how these methods are connected to each other.
Table 4 shows the results obtained from Stan-
ford parser trained with RSV vectors and two
other greedy transition-based dependency parsers
MaltParser (Nivre et al., 2006) and Parsito
(Straka et al., 2015). All the parsing models are
trained with the arc-standard system on the corpus
of universal dependencies. Par-St and Par-Sr refer
to the results reported for Parsito trained with
static oracle and search-based oracle. As shown,
in all cases, the parsing models trained with Stan-
ford parser and RSV (St-RSV) are more accurate
than other parsing models. The superiority of the
results obtained from St-RSV to Par-Sr shows the
importance of word vectors in dependency pars-
ing in comparison with adding more features to
the parser or performing the search-based oracle.
Par-St Par-Sr Malt St-RSV
UAS UAS UAS UAS
LAS LAS LAS LAS
English 86.7 87.4 86.387.6
84.2 84.7 82.984.9
Persian 83.8 84.5 80.885.4
80.2 81.1 77.282.4
Swedish 85.3 85.9 84.786.2
81.4 82.3 80.382.5
Table 4: Accuracy of dependency parsing. Par-St and Par-
Sr refer to the Parsito models trained with static oracle
and search-based oracle. St-RSV refers to the Stanford parser
trained with RSV vectors.
The results obtained from Stanford parser and
UDPipe (Straka et al., 2016) are summarized in
Table 5. The results are reported for both predicted
and gold POS tags. UDPipe sacrifice the greedy
nature of Parsito through adding a beam search
decoder to it. In general, one can argue that
UDPipe adds the following items to the Stan-
ford parser: 1) a search-based oracle, 2) a set of
morphological features, and 3) a beam search de-
coder. The almost similar results obtained from
both parsers for English show that a set of infor-
mative word vectors can be as influential as the
three extra items added by UDPipe. However,
the higher accuracies obtained from UDPipe for
Swedish and Persian, and the fact that the training
data for these languages are considerably smaller
than English, show the importance of these ex-
tra items on the accuracy of parsing model when
enough training data is not provided.
Predicted tags Gold tags
UDPipe St-RSV UDPipe St-RSV
UAS UAS UAS UAS
LAS LAS LAS LAS
English 84.084.687.587.6
80.280.9 85.084.9
Swedish 81.280.486.2 86.2
77.076.683.282.5
Persian 84.182.486.385.4
79.778.183.082.4
Table 5: Accuracy of dependency parsing on the corpus of
universal dependencies. St-RSV refers to the Stanford parser
trained with RSV vectors.
7 Nature of Dimensions
In this section, we study the nature of dimensions
formed by RSV. Starting from high-dimensional
space formed by the transformed co-occurrence
matrix f(C), word similarities can be measured
by a similarity matrix K=f(CT)f(C)whose
leading eigenvectors, corresponding to the leading
right singular vectors of f(C), form the RSV word
vectors. It suggests that RSV dimensions measure
a typical kind of word similarity on the basis of
variability of word’s contexts, since the eigenvec-
tors of Kaccount for the directions of largest vari-
ance in the word vectors defined by f(C).
To assess the validity of this statement, we study
the dimensions individually. For each dimension,
we first project all unit-sized word vectors onto it
and then sort the resulting data in ascending order
to see if any syntactic or semantic regularities can
be seen. Table 6 shows 10 words appearing in the
head of ordered lists related to the first 10 dimen-
sions. The dimensions are indexed according to
their related singular vectors. The table shows that
to some extent the dimensions match syntactic and
semantic word categories discussed in linguistics.
There is a direct relation between the indices and
the variability of word’s contexts. The regulari-
ties between the words appearing in highly vari-
able contexts, mostly the high frequency words,
are captured by the leading dimensions.
To a large extent, the first dimension accounts
for the degree of variability of word’s contexts.
Lower numbers are given to words that appear
in highly flexible contexts (i.e., high frequency
words such as as, but, in and ...). Dimensions 2–
5are informative for determining syntactic cate-
gories such as adjectives, proper nouns, function
words, and verbs. Dimensions 2and 8give lower
numbers to proper names (interestingly, mostly
last names in 2and male first names in 8). Some
kind of semantic regularity can also be seen in
most dimensions. For example, adjectives in di-
mension 2are mostly connected with society,
nouns in dimension 6denote humans, nouns in di-
mension 7are abstract, and words in dimension 9
are mostly connected to affective emotions.
8 Related Work on Word Vectors
The two dominant approaches to creating word
vectors (or word embeddings) are: 1) incremental
methods that update a set of randomly initialized
word vectors while scanning a corpus (Mikolov
et al., 2013; Collobert et al., 2011), and 2) batch
methods that extract a set of word vectors from a
co-occurrence matrix (Pennington et al., 2014; Le-
bret and Collobert, 2014). Pennington et al. (2014)
show that both approaches are closely related to
Dim Top 10 words
1 . – , is as but in ... so
2 domestic religious civilian russian physical social iraqi
japanese mexican scientific
3 mitchell reid evans allen lawrence palmer duncan rus-
sell bennett owen
4 . but in – the and or as at ,
5 ’s believes thinks asks wants replied tries says v. agrees
6 citizens politicians officials deputy businessmen law-
makers former elected lawyers politician
7 cooperation policy reforms policies funding reform ap-
proval compliance oversight assistance
8 geoff ron doug erik brendan kurt jeremy brad ronnie
yuri
9 love feeling sense answer manner desire romantic emo-
tional but ...
10 have were are but – . and may will
Table 6: Top 10 words projected on the top 10 dimensions
each other. Here, we elaborate the connections
between RSV and HPCA (Lebret and Collobert,
2014) and GloVe (Pennington et al., 2014).
HPCA performs Hellinger transformation fol-
lowed by principal component analysis on co-
occurrence matrix Cas below:
Y=SV T(3)
where Yis the matrix of word vecors, and Sand
Vare matrices of top singular values and right sin-
gular vectors of 2
√C. Since the word vectors are to
be used by a neural network, Lebret and Collobert
(2014) recommend to normalize them to avoid the
saturation problem in the network weights (Le-
Cun et al., 2012). Denoting ˜
Yas the empirical
mean of the column vectors in Yand σ(Y)as their
standard deviation, Eq. 4 suggested by Lebret and
Collobert (2014) normalizes the elements of word
vectors to have zero mean and a fixed standard de-
viation of λ≤1.
Υ = λ(Y−˜
Y)
σ(Y)(4)
˜
Yis 0if one centres the column vectors in 2
√C
around their mean before performing PCA. Sub-
stituting Eq. 3 into Eq. 4 and the facts that ˜
Y=0
and σ(Y) = 1
√n−1S, where nis the number of
words, we reach Eq. 1.
In general, one can argue that RSV generalises
the idea of Hellinger transformation used in HPCA
through a set of more general transformation func-
tions. Other differences between RSV and HPCA
are in 1) how they form the co-occurrence matrix
C, and 2) when they centre the data. For each
word wiand each context word wj,Ci,j in RSV
is p(wj|wi), but p(wi|wj)in HPCA. In RSV, the
column vectors of f(C)are centred around their
means before performing SVD, but in HPCA, the
data are centred after performing PCA. In Sec. 6,
we showed that these changes result in significant
improvement in the quality of word vectors.
The connections between RSV and GloVe is as
follows. GloVe extracts word vectors from a co-
occurrence matrix transformed by logarithm func-
tion. Using a global regression model, Pennington
et al. (2014) argue that linear directions of mean-
ings is captured by the matrix of word vectors Υn,k
with following property:
ΥTΥ = log(C) + b1 (5)
where, Cn,n is the co-occurrence matrix, bn,1is a
bias vector, and 11,n is a vector of ones. Denoting
Υias ith column of Υand assuming kΥik= 1 for
i= 1 . . . n, the left-hand side of Eq. 5 measures
the cosine similarity between unit-sized word vec-
tors Υiin a kernel space and the right-hand side
is the corresponding kernel matrix. Using ker-
nel principal component analysis (Sch¨olkopf et al.,
1998), a k-dimensional estimation of Υin Eq. 5 is
Υ = √SV T(6)
where Sand Vare the matrices of top singular
values and singular vectors of K. Replacing the
kernel matrix in Eq. 5 with the second degree poly-
nomial kernel K=f(CT)f(C), which measures
the similarities on the basis of the column vectors
defined by the co-occurrence matrix, the word vec-
tors generated by Eq. 6 and Eq. 1 are distributed in
the same directions but with different variances. It
shows that the main difference between RSV and
GloVe is in the kernel matrices they are using.
9 Conclusion
In this paper, we have proposed to form a set of
word vectors from the right singular vectors of a
co-occurrence matrix that is transformed by a 7th-
root transformation function. It has been shown
that the proposed method is closely related to pre-
vious methods of word embedding such as HPCA
and GloVe. Our experiments on the task of de-
pendency parsing show that the parsing models
trained with our word vectors are more accurate
than the parsing models trained with other popular
methods of word embedding.
References
Abolfazl AleAhmad, Hadi Amiri, Ehsan Darrudi, Ma-
soud Rahgozar, and Farhad Oroumchian. 2009.
Hamshahri: A standard persian text collection.
Knowledge-Based Systems, 22(5):382–387.
Taylor Berg-Kirkpatrick, David Burkett, and Dan
Klein. 2012. An empirical investigation of statis-
tical significance in nlp. In Proceedings of the 2012
Joint Conference on Empirical Methods in Natu-
ral Language Processing and Computational Natu-
ral Language Learning, EMNLP-CoNLL ’12, pages
995–1005, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Danqi Chen and Christopher Manning. 2014. A fast
and accurate dependency parser using neural net-
works. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 740–750.
Ronan Collobert, Jason Weston, L´eon Bottou, Michael
Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from
scratch. The Journal of Machine Learning Re-
search, 12:2493–2537.
Marie-Catherine De Marneffe and Christopher D Man-
ning. 2010. Stanford typed dependencies manual
(2008). URL: http://nlp.stanford.edu/
software/dependencies_manual.pdf.
Marie-Catherine De Marneffe, Bill MacCartney,
Christopher D Manning, et al. 2006. Generat-
ing typed dependency parses from phrase structure
parses. In Proceedings of LREC, volume 6, pages
449–454.
Richard Johansson and Pierre Nugues. 2007. Ex-
tended con stituent-to-dependency conversion for en-
glish. In 16th Nordic Conference of Computational
Linguistics, pages 105–112. University of Tartu.
R´emi Lebret and Ronan Collobert. 2014. Word em-
beddings through hellinger pca. In Proceedings of
the 14th Conference of the European Chapter of the
Association for Computational Linguistics, pages
482–490, Gothenburg, Sweden, April. Association
for Computational Linguistics.
R´emi Lebret and Ronan Collobert. 2015. Rehabilita-
tion of count-based models for word vector repre-
sentations. In Computational Linguistics and Intel-
ligent Text Processing, pages 417–429. Springer.
Yann A LeCun, L´eon Bottou, Genevieve B Orr, and
Klaus-Robert M¨uller. 2012. Efficient backprop. In
Neural networks: Tricks of the trade, pages 9–48.
Springer.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated
corpus of English: The Penn treebank. Computa-
tional Linguistics - Special issue on using large cor-
pora, 19(2):313 – 330, June.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word represen-
tations in vector space. In Proceedings of Workshop
at ICLR.
Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
Maltparser: A data-driven parser-generator for de-
pendency parsing. In Proceedings of the 5th In-
ternational Conference on Language Resources and
Evaluation (LREC), pages 2216–2219.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
ter, Yoav Goldberg, Jan Hajic, Christopher D Man-
ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
Natalia Silveira, et al. 2016. Universal dependen-
cies v1: A multilingual treebank collection. In Pro-
ceedings of the 10th International Conference on
Language Resources and Evaluation (LREC 2016).
Joakim Nivre. 2004. Incrementality in deterministic
dependency parsing. In Proceedings of the Work-
shop on Incremental Parsing: Bringing Engineering
and Cognition Together, pages 50–57. Association
for Computational Linguistics.
Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word
representation. In EMNLP, volume 14, pages 1532–
1543.
Bernhard Sch¨olkopf, Alexander Smola, and Klaus-
Robert M¨uller. 1998. Nonlinear component anal-
ysis as a kernel eigenvalue problem. Neural compu-
tation, 10(5):1299–1319.
Mojgan Seraji. 2015. Morphosyntactic Corpora and
Tools for Persian. Ph.D. thesis, Uppsala University.
Milan Straka, Jan Hajic, Jana Strakov´a, and Jan Ha-
jic jr. 2015. Parsing universal dependency treebanks
using neural networks and search-based oracle. In
International Workshop on Treebanks and Linguis-
tic Theories (TLT14), pages 208–220.
Milan Straka, Jan Hajic, and Jana Strakov. 2016.
Udpipe: Trainable pipeline for processing conll-u
files performing tokenization, morphological anal-
ysis, pos tagging and parsing. In Nicoletta Cal-
zolari (Conference Chair), Khalid Choukri, Thierry
Declerck, Sara Goggi, Marko Grobelnik, Bente
Maegaard, Joseph Mariani, Helene Mazo, Asun-
cion Moreno, Jan Odijk, and Stelios Piperidis, edi-
tors, Proceedings of the Tenth International Confer-
ence on Language Resources and Evaluation (LREC
2016), Paris, France, may. European Language Re-
sources Association (ELRA).
A Tropp, N Halko, and PG Martinsson. 2009. Find-
ing structure with randomness: Stochastic algo-
rithms for constructing approximate matrix decom-
positions. Technical report, Applied & Computa-
tional Mmathematics, California Institute of Tech-
nology.