Conference PaperPDF Available

Real-valued Syntactic Word Vectors (RSV) for Greedy Neural Dependency Parsing

May 2017

May 2017

Conference: Proceedings of the 21st Nordic Conference on Computational Linguistics, NoDaLiDa
At: Gothenburg, Sweden

Authors:

Ali Basirat

University of Copenhagen

Joakim Nivre

Uppsala University

We show that a set of real-valued word vectors formed by right singular vectors of a transformed co-occurrence matrix are meaningful for determining different types of dependency relations between words. Our experimental results on the task of dependency parsing confirm the superiority of the word vectors to the other sets of word vectors generated by popular methods of word embedding. We also study the effect of using these vectors on the accuracy of dependency parsing in different languages versus using more complex parsing architectures.

Content uploaded by Ali Basirat

Content may be subject to copyright.

Real-valued Syntactic Word Vectors (RSV) for Greedy Neural

Dependency Parsing

Ali Basirat and Joakim Nivre

Department of Linguistics and Philology

Uppsala University

{ali.basirat,joakim.nivre}@lingfil.uu.se

Abstract

We show that a set of real-valued word

vectors formed by right singular vectors

of a transformed co-occurrence matrix are

meaningful for determining different types

of dependency relations between words.

Our experimental results on the task of de-

pendency parsing conﬁrm the superiority

of the word vectors to the other sets of

word vectors generated by popular meth-

ods of word embedding. We also study

the effect of using these vectors on the

accuracy of dependency parsing in differ-

ent languages versus using more complex

parsing architectures.

1 Introduction

Greedy transition-based dependency parsing is ap-

pealing thanks to its efﬁciency, deriving a parse

tree for a sentence in linear time using a discrimi-

native classiﬁer. Among different methods of clas-

siﬁcation used in a greedy dependency parser, neu-

ral network models capable of using real-valued

vector representations of words, called word vec-

tors, have shown signiﬁcant improvements in both

accuracy and speed of parsing. It was ﬁrst pro-

posed by Chen and Manning (2014) to use word

vectors in a 3-layered feed-forward neural network

as the core classiﬁer in a transition-based depen-

dency parser. The classiﬁer is trained by the stan-

dard back-propagation algorithm. Using a limited

number of features deﬁned over a certain number

of elements in a parser conﬁguration, they could

build an efﬁcient and accurate parser, called the

Stanford neural dependency parser. This archi-

tecture then was extended by Straka et al. (2015)

and Straka et al. (2016). Parsito (Straka et al.,

2015) adds a search-based oracle and a set of mor-

phological features to the original architecture in

order to make it capable of parsing the corpus of

universal dependencies. UDPipe (Straka et al.,

2016) adds a beam search decoding to Parsito

in order to improve the parsing accuracy at the cost

of decreasing the parsing speed.

We propose to improve the parsing accuracy

in the architecture introduced by Chen and Man-

ning (2014) through using more informative word

vectors. The idea is based on the greedy nature

of the back-propagation algorithm which makes

it highly sensitive to the initial state of the algo-

rithm. Thus, it is expected that more qualiﬁed

word vectors positively affect the parsing accu-

racy. The word vectors in our approach are formed

by right singular vectors of a matrix returned by

a transformation function that takes a probability

co-occurrence matrix as input and expand the data

massed around zero. We show how the proposed

method is related to HPCA (Lebret and Collobert,

2014) and GloVe (Pennington et al., 2014).

Using these word vectors with the Stanford

parser we could obtain the parsing accuracy of

93.0% UAS and 91.7% LAS on Wall Street Jour-

nal (Marcus et al., 1993). The word vectors con-

sistently improve the parsing models trained with

different types of dependencies in different lan-

guages. Our experimental results show that pars-

ing models trained with Stanford parser can be as

accurate or in some cases more accurate than other

parsers such as Parsito, and UDPipe.

2 Transition-Based Dependency Parsing

A greedy transition-based dependency parser de-

rives a parse tree from a sentence by predicting a

sequence of transitions between a set of conﬁgu-

rations characterized by a triple c= (Σ, B, A),

where Σis a stack that stores partially processed

nodes, Bis a buffer that stores unprocessed nodes

in the input sentence, and Ais a set of partial

parse trees assigned to the processed nodes. Nodes

are positive integers corresponding to linear posi-

tions of words in the input sentence. The process

of parsing starts from an initial conﬁguration and

ends with some terminal conﬁguration. The tran-

sitions between conﬁgurations are controlled by a

classiﬁer trained on a history-based feature model

which combines features of the partially built de-

pendency tree and attributes of input tokens.

The arc-standard algorithm (Nivre, 2004) is

among the many different algorithms proposed for

moving between conﬁgurations. The algorithm

starts with the initial conﬁguration in which all

words are in B,Σis empty, and Aholds an ar-

tiﬁcial node 0. It uses three actions Shift,Right-

Arc, and Left-Arc to transition between the conﬁg-

urations and build the parse tree. Shift pushes the

head node in the buffer into the stack uncondition-

ally. The two actions Left-Arc and Right-Arc are

used to build left and right dependencies, respec-

tively, and are restricted by the fact that the ﬁnal

dependency tree has to be rooted at node 0.

3 Stanford Dependency Parser

The Stanford dependency parser can be consid-

ered as a turning point in the history of greedy

transition-based dependency parsing. The parser

could signiﬁcantly improve both the accuracy and

speed of dependency parsing. The key success of

the parser can be summarized in two points: 1) ac-

curacy is improved by using a neural network with

pre-trained word vectors, and 2) efﬁciency is im-

proved by pre-computation to keep the most fre-

quent computations in the memory.

The parser is an arc-standard system with a

feed-forward neural-network as its classiﬁer. The

neural network consists of three layers: An in-

put layer connects the network to a conﬁguration

through 3real-valued vectors representing words,

POS tags, and dependency relations. The vec-

tors that represent POS-tags and dependency re-

lations are initialized randomly but those that rep-

resent words are initialized by word vectors sys-

tematically extracted from a corpus. Each of these

vectors are independently connected to the hidden

layer of the network through three distinct weight

matrices. A cube activation function is used in

the hidden layer to model the interactions between

the elements of the vectors. The activation func-

tion resembles a third degree polynomial kernel

that enables the network to take different combi-

nations of vector elements into consideration. The

output layer generates probabilities for decisions

between different actions in the arc-standard sys-

tem. The network is trained by the standard back-

propagation algorithm that updates both network

weights and vectors used in the input layer.

4 Word Vectors

Dense vector representations of words, in this pa-

per known as word vectors, have shown great im-

provements in natural language processing tasks.

An advantage of this representation compared to

the traditional one-hot representation is that the

word vectors are enriched with information about

the distribution of words in different contexts.

Following Lebret and Collobert (2014), we pro-

pose to extract word vectors from a co-occurrence

matrix as follows: First we build a co-occurrence

matrix Cfrom a text. The element Ci,j is a max-

imum likelihood estimation of the probability of

seeing word wjin the context of word wi, (i.e.,

Ci,j =p(wj|wi). It results in a sparse matrix

whose data are massed around zero because of the

disproportional contribution of the high frequency

words in estimating the co-occurrence probabili-

ties. Each column of Ccan be seen as a vector in

a high-dimensional space whose dimensions cor-

respond to the context words. In practice, we need

to reduce the dimensionality of these vectors. This

can be done by standard methods of dimensional-

ity reduction such as principal component analy-

sis. However, the high density of data around zero

and the presence of a small number of data points

far from the data mass can lead to some meaning-

less discrepancies between word vectors.

In order to have a better representation of the

data we expand the probability values in Cby

skewing the data mass from zero toward one. This

can be done by any monotonically increasing con-

cave function that magniﬁes small numbers in its

domain while preserving the given order. Com-

monly used transformation functions with these

characteristics are the logarithm function, the hy-

perbolic tangent, the power transformation, and

the Box-cox transformation with some speciﬁc

parameters. Fig. 1 shows how a transformation

function expands high-dimensional word vectors

massed around zero.

After applying fon C, we centre the column

vectors in f(C)around their mean and build the

word vectors as below:

Υ = γVT

n,k (1)

where Υis a matrix of kdimensional word vec-

tors associated with nwords, VT

n,k is the top k

-10 -8 -6 -4 -2 0 2

-10

-8

-6

-4

-2

the

NUMBER

andin a

for ''

for '' ‘‘

that is

was

-lrb-

-rrb-

with as

by ithe

at said

from his

are an

(a)

-70 -60 -50 -40 -30 -20 -10 0 10

-20

-15

-10

-5

the

,.of

NUMBER

and

for''

‘‘

that

on 's

was

-lrb-

-rrb-

with

as by

said

from

his

are

;

have

has

but

this

not i

were

they

had

who

which

their

will

its

new

one

after

been

also

would

about

you

first

two

when

n't

all

there

her

she

other

out

people

can

than

year

into

some

over

percent

time

last

years

government

world

only

what

could

most

them

may

him

president

three

million

u.s.

state

united

many

againstlike

during

before

where

now

did

just

because

while

states

between

since

made

city

such

then

national

through

under

any

down

these

being

group

company

back

country

day

‘

news

american

even

york

both

second

south our those

including

part

well

get

still make

war

school

work

off

...

international

team

used

how

very

week

much

way

here

four

end

later

should

china

minister

see

home

house

another

billion

NUMBERth

take

former

police

around

high

times

told

officials

party

north

tuesday

called

use

same

say

wednesday

public market

friday

thursday

monday

several

game

until

bank

going

good

think

says

your

military

early

number

each

season

reuters

however

family

set

according

university

does

five

long

following

political

found

public market

friday

thursday

monday

several

game

until

bank

going

good

think

says

your

military

early

number

each

season

reuters

however

family

set

according

university

does

five

long

following

political

found

known

life

security

left

service

major

know

own

general

place

top

foreign

right

court

area

points

-rsb-

took

-lsb-

local

won

john days

members

among

business

economic

best

system

help

washington

women

series

west

want

march

came

report

children

show

sunday

without

name

official

power

months

support

few

little

became held

come

month

man

countries

film

oil

dollars

never

men

white companies

law

third

june

six

expected

moved

iraq

central NUMBERs

european

killed

too

began

british

reported

late

every

games

play

office

saturday

july

east

put

league

great

case

money bush

today

town

center

final

small

financial

played

open

april

added

chief

NUMBER.NUMBER

least

trade

released

meeting

line

within &

got

need

again

information

air

past

music county

're

federal

program

along

head

big

although

death

went

must

history

point

capital

might

side

become

night

army

order

japan

born

due

force

health

far

near

ago

forces

french

prime

earlier away

old

though

union

win

saying

.NUMBER

january

district

leader

based

prices

large

development

department

often

september

region

move

record

lost

lead

france

water

israel

main

whether

led

director

member

less

others

plan

chinese

age

already

recent

different

run

story december

cup

once

economy

control

category

NUMBER-NUMBER

given

better

rights

america

october

peace

statement

council

give

young

price

per

total

half

deal

november

died

making

important

india

london

campaign

leaders russia

stock

press

building

further

population

election

hit

having

station

change

august

lot

club

college

talks

research

book

students

sales

black

seen

really

television

include

conference

human

NUMBER-NUMBER:NUMBER

community

street

industry troops

start

call

issue

find

taken

agency

rate

look

announced

asked

something

defense

bill

clinton

free

attack

committee

across

current

working

church

rose

interest

media

europe

services

received

album

germany

results

started

-NUMBER

real san

february

list

special

index

decision

things

spokesman

players

growth

southern

administration

keep

enough

fell

period

policy

site

named

despite

using

possible

groups

role

board

average

process

together

weeks

almost

seven

level

return

german

nations

cut

behind

football

words

father

've

NUMBER-year-old

yet

whose

strong

england

plans

israeli

northern

road australia

social

higher

full

russian

live

western

energy

food future

trying

david

always

outside

land

form

car

single

island

able

george

nearly

nation

issues

river

park

production

largest

likely

taking

global

data century

himself

why

round

race

nuclear

king

field

hours

education

son

areas

visit

attacks

california

fire

workers

key

player

action

face

japanese

leading

eight

exchange

ever

agreement

included

vote minutes

career africa

gave

red congress

lower

senior

available

korea

palestinian

markets

match

association

democratic

built

problems

executivevillage

located

english

art

result

pay

similar

thought

secretary

job

ministry

believe

front

continue

private

band

file

position

sent

radio

clear

project

body

love

published

living

increase

short

done

authorities

reports

middle

hard

dollar

wanted

title

course

opposition

iran elections

coach

miles

problem

commission

served

act

province

playing

los

daily

mother

release

shot

wife

low

soon

division

woman

fact

canada

michael

study

rates

meet

means

fall

britain

post

song thing

provide

victory

worked

saw

(b)

Figure 1: PCA visualization of the high-dimensional

column-vectors of a co-occurrence matrix (a) before and (b)

after applying the transformation function 10

√x

right singular vectors of f(C), and γ=λ√nis a

constant factor to scale the unbounded data in the

word vectors. In the following, we will refer to our

model as RSV, standing for Right Singular word

Vector or Real-valued Syntactic word Vectors as it

is mentioned in the title of this paper.

5 Experimental Setting

We restrict our experiments to three languages,

English, Swedish, and Persian. Our experiments

on English are organized as follows: Using dif-

ferent types of transformation functions, we ﬁrst

extract a set of word vectors that gives the best

parsing accuracy on our development set. Then we

study the effect of dimensionality on parsing per-

formance. Finally, we give a comparison between

our best results and the results obtained from other

sets of word vectors generated with popular meth-

ods of word embedding. Using the best transfor-

mation function obtained for English, we extract

word vectors for Swedish and Persian. These word

vectors are then used to train parsing models on

the corpus of universal dependencies.

The English word vectors are extracted from

a corpus consisting of raw sentences in Wall

Street Journal (WSJ) (Marcus et al., 1993), En-

glish Wikicorpus,1Thomson Reuters Text Re-

1http://www.cs.upc.edu/˜nlp/wikicorpus/

search Collection (TRC2), English Wikipedia cor-

pus,2, and the Linguistic Data Consortium (LDC)

corpus. We concatenate all the corpora and split

the sentences by the OpenNLP sentence splitting

tool.The Stanford tokenizer is used for tokeniza-

tion. Word vectors for Persian are extracted from

the Hamshahri Corpus (AleAhmad et al., 2009),

Tehran Monolingual Corpus,3and Farsi Wikipedia

download from Wikipedia Monolingual Corpora.4

The Persian text normalizer tool (Seraji, 2015) is

used for sentence splitting and tokenization.5Word

vectors for Swedish are extracted from Swedish

Wikipedia available at Wikipedia Monolingual

Corpora, Swedish web news corpora (2001-2013)

and Swedish Wikipedia corpus collected by Sprk-

banken. 6The OpenNLP sentence splitter and to-

kenizer are used for normalizing the corpora.

We replace all numbers with a special token

NUMBER and convert uppercase letters to lower-

case forms in English and Swedish. Word vectors

are extracted only for the unique words appearing

at least 100 times. We choose the cut-off word

frequency of 100 because it is commonly used as

a standard threshold in the other references. The

10 000 most frequent words are used as context

words in the co-occurrence matrix. Table 1 rep-

resents some statistics of the corpora.

#Tokens #W≥1#W≥100 #Sents

English 8×10914 462 600 404 427 4 ×108

Persian 4×1081 926 233 60 718 1 ×107

Swedish 6×1085 437 176 174 538 5 ×107

Table 1: Size of the corpora from which word vectors are

extracted; #Tokens: total number of tokens; #W≥k: number

of unique words appearing at least ktimes in the corpora;

#Sents: number of sentences.

The word vectors are evaluated with respect

to the accuracy of parsing models trained with

them using the Stanford neural dependency parser

(Chen and Manning, 2014). The English parsing

models are trained and evaluated on the corpus of

universal dependencies (Nivre et al., 2016) version

1.2(UD) and Wall Street Journal (WSJ) (Marcus

et al., 1993) annotated with Stanford typed depen-

dencies (SD) (De Marneffe and Manning, 2010)

and CoNLL syntactic dependencies (CD) (Johans-

2https://dumps.wikimedia.org/enwiki/latest/

enwiki-latest- pages- articles.xml.bz2- rss.xml

3http://ece.ut.ac.ir/system/files/NLP/Resources/

4http://linguatools.org/tools/corpora/

wikipedia-monolingual- corpora

5http://stp.lingfil.uu.se/˜mojgan

6https://spraakbanken.gu.se/eng/resources/corpus

son and Nugues, 2007). We split WSJ as follow:

sections 02–21 for training, section 22 for devel-

opment, and section 23 as test set. The Stanford

conversion tool (De Marneffe et al., 2006) and the

LTH conversion tool7are used for converting con-

stituency trees in WSJ to SD and CD. The Swedish

and Persian parsing models are trained on the cor-

pus of universal dependencies. All the parsing

models are trained with gold POS tags unless we

clearly mention that predicted POS tags are used.

6 Results

In the following we study how word vectors gener-

ated by RSV inﬂuence parsing performance. RSV

has four main tuning parameters: 1) the context

window, 2) the transformation function f, 3) the

parameter λused in the normalization step, and

4) the number of dimensions. The context win-

dow can be symmetric or asymmetric with dif-

ferent length. We choose the asymmetric context

window with length 1i.e., the ﬁrst preceding word,

as it is suggested by Lebret and Collobert (2015)

for syntactic tasks. λis a task dependent param-

eter that controls the variance of word vectors. In

order to ﬁnd the best value of λ, we train the parser

with different sets of word vectors generated ran-

domly by Gaussian distributions with zero mean

and isotropic covariance matrices λIwith values

of λ∈(0,1]. Fig. 2a shows the parsing accura-

cies obtained from these word vectors. The best

results are obtained from word vectors generated

by λ= 0.1and λ= 0.01. The variation in the re-

sults shows the importance of the variance of word

vectors on the accuracy of parser. Regarding this

argument, we set the normalization parameter λin

Eq. 1 equal to 0.1. The two remaining parameters

are explored in the following subsections.

6.1 Transformation Function

We have tested four sets of transformation func-

tions on the co-occurrence matrix:

•f1= tanh(nx)

•f2=n

√x

•f3=n(n

√x−1)

•f4=log(2n+1x+1)

log(2n+1+1)

where x∈[0,1] is an element of the matrix and

nis a natural number that controls the degree of

skewness, the higher the value of nis, the more

7http://nlp.cs.lth.se/software/

treebank-converter/

the data will be skewed. Fig. 2b shows the effect

of using these transformation functions on parsing

accuracy. Best results are obtained from nth-root

and Box-cox transformation functions with n= 7,

which are closely related to each other. Denoting

the set of word vectors obtained from the transfor-

mation function fas Υ(f), it can be shown that

Υ(f3) = Υ(f2)−n, since the effect of coefﬁcient

nin the ﬁrst term of f3is cancelled out by the right

singular vectors in Eq. 1.

Fig. 2c visualizes best transformation functions

in each of the function sets. All the functions

share the same property of having relatively high

derivatives around 0which allows of skewing data

in the co-occurrence matrix to right (i.e., close to

one) and making a clear gap between the syntactic

structures that can happen (i.e., non-zeros proba-

bilities in the co-occurrence matrix) and those that

cannot happen (i.e., zero probabilities in the co-

occurrence matrix. This movement from zero to

one, however, can lead to disproportional expan-

sions between the data close to zero and those that

are close to one. It is because the limit of the ra-

tio of the derivative of fi(x)i= 1,2,3,4to the

derivative of fi(y)as xapproaches to 0and yap-

proaches to 1is inﬁnity i.e., limx→0,y→1

f′

i(x)

f′

i(y)=

∞. The almost uniform behaviour of f1(x)for

x > 0.4results in a small variance in the generated

data that will be ignored by the subsequent sin-

gular value decomposition step and consequently

loses the information provided with the most prob-

able context words. Our solution to these prob-

lems is to use the following piecewise transforma-

tion function f:

f=tanh(7x)x≤θ

√x x > θ (2)

where θ= 10−nand n∈N. This function ex-

pands the data in a more controlled manner (i.e.,

limx→0,y→1f′(x)

f′(y)= 49) with less lost in informa-

tion provided with the variance of data. Using this

function with θ= 10−7, we could get UAS of 92.3

and LAS of 90.9on WSJ development set anno-

tated with Stanford typed dependencies, which is

slightly better than other transformation functions

(see Fig. 2b). We obtain UAS of 92.0and LAS of

90.6on the WSJ test set with the same setting.

6.2 Dimensionality

The dimensionality of word vectors is determined

by the number of singular vectors used in Eq. 1.

1e-05

0.0001

0.001

0.01

0.1

0.2

90.2

90.4

90.6

90.8

91.2

91.4

91.6

(a)

12345678910

91.2

91.4

91.6

91.8

92.2

92.4

(b)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

f1= tanh(7x)

f2=7

√x

7(f3+ 7) = 7

√x

f4=log(29x+1)

29+1

(c)

Figure 2: Unlabelled attachment score of parsing models trained with (a) the randomly generated word vectors, and (b)

the systematically extracted word vectors using different transformation functions. The experiments are carried out with 50

dimensional word vectors. Parsing models are evaluated on the development set in Wall Street Journal annotated with Satnford

typed dependencies (SD). fin (b) is the piecewise transformation function shown in Eq. 2. c: the transformation functions in

each function set resulting in the best parsing models. The vertical axis shows the data in a probability co-occurrence matrix

and the vertical axis shows their projection after transformation. For better visualization, the range of f3is scaled to [0,1].

High dimensional word vectors are expected to

result in higher parsing accuracies. It is because

they can capture more information from the origi-

nal data, i.e., the Frobenius norm of the deference

between the original matrix and its truncated es-

timation depends on the number of top singular

vectors used for constructing the truncated ma-

trix. This achievement, however, is at the cost

of more computational resources a) to extract the

word vectors, and b) to process the word vectors

by parser. The most expensive step to extract the

word vectors is the singular value decomposition

of the transformed co-occurrence matrix. Using

the randomized SVD method described by Tropp

et al. (2009), the extraction of ktop singular vec-

tors of an m×nmatrix requires O(mn log(k))

ﬂoating point operations. It shows that the cost for

having larger word vectors grows logarithmically

with the number of dimensions.

The parsing performance is affected by the di-

mensionality of the word vectors, fed into the in-

put layer of the neural network, in two ways: First,

higher number of dimensions in the input layer

lead to a larger weight matrix between the input

layer and the hidden layer. Second, larger hid-

den layer is needed to capture the dependencies

between the elements in the input layer. Given

a set of word vectors with kdimensions con-

nected to the hidden layer with hhidden units, the

weight matrix between the input layer and the hid-

den layer grows with the scale of O(kh), and the

weight matrix between the hidden layer and the

output layer grows with the scale of O(h). For

each input vector, the back-propagation algorithm

passes the weight matrices three times per iteration

1) to forward each input vector through the net-

work, 2) to back propagate the errors, generated

by the inputs, and 3) to update the network param-

eters. So, each input vector needs O(3(kh +h)))

time to be processed by the algorithm. Given the

trained model, the output signals are generated

through only one forward pass.

Table 2 shows how high dimensional word vec-

tors affect the parsing performance. In general,

increasing the number of hidden units leads to a

more accurate parsing model at a linear cost of

parsing speed. Increasing the dimensionality of

word vectors to 200 dimensions consistently in-

creases the parsing accuracy at again the linear

time of parsing speed. However, increasing both

the dimensionality of word vectors and the size of

the hidden layer leads to a quadratic decrease in

parsing speed. The best results are obtained from

the parsing model trained with 100-dimensional

word vectors and 400 hidden units, resulting in the

parsing accuracy of 93.0UAS and 91.7LAS on

our test set, +1.0UAS and +1.1LAS improve-

ment over what we obtained with 50 dimensional

word vectors. It is obtained at the cost of 47% re-

duction in the parsing speed.

6.3 Comparison and Consistency

We evaluate the RSV word vectors on different

types of dependency representations and differ-

ent languages. Table 3 gives a comparison be-

tween RSV and different methods of word embed-

ding with respect to their contributions to depen-

dency parsing and the time required to generate

word vectors for English. All the parsing mod-

els are trained with 400 hidden units and 100-

dimensional word vectors extracted from English

raw corpus described in Sec. 5. The word vec-

h→200 300 400

↓k UAS LAS P UAS LAS P UAS LAS P

50 92.3 90.9 392 92.9 91.5 307 93.0 91.6 237

100 92.6 91.2 365 92.9 91.5 263 93.1 91.8 206

150 92.6 91.2 321 92.9 91.5 236 93.1 91.8 186

200 92.7 91.3 310 93.1 91.7 212 93.1 91.8 165

250 92.7 91.2 286 92.9 91.5 201 93.0 91.7 146

300 92.6 91.2 265 92.9 91.6 180 92.9 91.5 119

350 92.7 91.2 238 92.8 91.4 174 92.9 91.5 111

400 92.6 91.2 235 92.8 91.3 141 93.0 91.5 97

Table 2: The performance of parsing models trained with

k-dimensional word vectors and hhidden units. The pars-

ing accuracies (UAS, LAS) are for the development set. P:

parsing speed (sentence/second).

tors are extracted by a Linux machine running on

12 CPU cores. The free parameters of the word

embedding methods, i.e., context type and con-

text size, have been tuned on the development

set and the best settings, resulting in the high-

est parsing accuracies, were then chosen for com-

parison. It leads us to asymmetric window of

length 1for RSV,GloVe and HPCA, and sym-

metric window of length 1for word2vec mod-

els, CBOW and SkipGram. The GloVe word vec-

tors are extracted by available implementation of

GloVe (Pennington et al., 2014) running for 50

iterations. The HPCA word vectors are extracted

by our implementation of the method (Lebret and

Collobert, 2014). CBOW and SkipGram word

vectors are extracted by available implementation

of word2vec (Mikolov et al., 2013) running for

10 iterations, and a negative sampling value of 5.

In order to show the validity of the results, we per-

form a bootstrap statistical signiﬁcance test (Berg-

Kirkpatrick et al., 2012) on the results obtained

from each parsing experiment and RSV with the

null hypothesis H0:RSV is no better than the

model B, where Bcan be any of the word embed-

ding methods. The resulting p-values are reported

together with the parsing accuracies.

The empirical results show that HPCA,RSV, and

GloVe are ranked as fastest methods of word em-

bedding in order of time. The reason why these

methods are faster than word2vec is because

they scan the corpus only one time and then store

it as a co-occurrence matrix in memory. The rea-

son why HPCA is faster than RSV is because HPCA

stores the co-occurrence matrix as a sparse matrix

but RSV stores it as a full matrix. This expense

makes RSV more qualiﬁed than HPCA when they

are used in the task of dependency parsing.

The results obtained from RSV word vectors are

Model Time

SD CD UD

UAS LAS UAS LAS UAS LAS

p-val p-val p-val p-val p-val p-val

CBOW 8741 93.0 91.5 93.4 92.6 88.0 85.4

0.00 0.00 0.02 0.00 0.00 0.00

SGram 11113 93.0 91.6 93.4 92.5 87.4 84.9

0.06 0.02 0.00 0.00 0.00 0.00

GloVe 3150 92.9 91.6 93.5 92.6 88.4 85.8

0.02 0.02 0.04 0.06 0.54 0.38

HPCA 2749 92.1 90.8 92.5 91.7 86.6 84.0

0.00 0.00 0.00 0.00 0.00 0.00

RSV 2859 93.1 91.8 93.6 92.8 88.4 85.9

Table 3: Performance of word embedding methods: Qual-

ity of word vectors are measured with respect to parsing mod-

els trained with them. The efﬁciency of models is measured

with respect to the time (seconds) required to extract a set of

word vectors. Parsing models are evaluated on our English

development set; SGram: SkipGram, SD: Stanford typed De-

pendencies, CD: CoNLL Dependencies, UD: Universal De-

pendencies, and p-val: p-value of the null hypothesis: RSV is

no better than the word embedding method corresponding to

each cell of the table.

comparable and slightly better than other sets of

word vectors. The difference between RSV and

other methods is more clear when one looks at the

difference between the labelled attachment scores.

Apart from the parsing experiment with GloVe on

the universal dependencies, the relatively small

p-values reject our null hypothesis and conﬁrms

that RSV can result in more qualiﬁed word vectors

for the task of dependency parsing. In addition

to this, the constant superiority of the results ob-

tained from RSV on different dependency styles is

an evidence that the results are statistically signif-

icant, i.e., the victory of RSV is not due merely

to chance. Among the methods of word embed-

ding, we see that the results obtained from GloVe

are more close to RSV, especially when they come

with universal dependencies. We show in Sec. 8

how these methods are connected to each other.

Table 4 shows the results obtained from Stan-

ford parser trained with RSV vectors and two

other greedy transition-based dependency parsers

MaltParser (Nivre et al., 2006) and Parsito

(Straka et al., 2015). All the parsing models are

trained with the arc-standard system on the corpus

of universal dependencies. Par-St and Par-Sr refer

to the results reported for Parsito trained with

static oracle and search-based oracle. As shown,

in all cases, the parsing models trained with Stan-

ford parser and RSV (St-RSV) are more accurate

than other parsing models. The superiority of the

results obtained from St-RSV to Par-Sr shows the

importance of word vectors in dependency pars-

ing in comparison with adding more features to

the parser or performing the search-based oracle.

Par-St Par-Sr Malt St-RSV

UAS UAS UAS UAS

LAS LAS LAS LAS

English 86.7 87.4 86.387.6

84.2 84.7 82.984.9

Persian 83.8 84.5 80.885.4

80.2 81.1 77.282.4

Swedish 85.3 85.9 84.786.2

81.4 82.3 80.382.5

Table 4: Accuracy of dependency parsing. Par-St and Par-

Sr refer to the Parsito models trained with static oracle

and search-based oracle. St-RSV refers to the Stanford parser

trained with RSV vectors.

The results obtained from Stanford parser and

UDPipe (Straka et al., 2016) are summarized in

Table 5. The results are reported for both predicted

and gold POS tags. UDPipe sacriﬁce the greedy

nature of Parsito through adding a beam search

decoder to it. In general, one can argue that

UDPipe adds the following items to the Stan-

ford parser: 1) a search-based oracle, 2) a set of

morphological features, and 3) a beam search de-

coder. The almost similar results obtained from

both parsers for English show that a set of infor-

mative word vectors can be as inﬂuential as the

three extra items added by UDPipe. However,

the higher accuracies obtained from UDPipe for

Swedish and Persian, and the fact that the training

data for these languages are considerably smaller

than English, show the importance of these ex-

tra items on the accuracy of parsing model when

enough training data is not provided.

Predicted tags Gold tags

UDPipe St-RSV UDPipe St-RSV

UAS UAS UAS UAS

LAS LAS LAS LAS

English 84.084.687.587.6

80.280.9 85.084.9

Swedish 81.280.486.2 86.2

77.076.683.282.5

Persian 84.182.486.385.4

79.778.183.082.4

Table 5: Accuracy of dependency parsing on the corpus of

universal dependencies. St-RSV refers to the Stanford parser

trained with RSV vectors.

7 Nature of Dimensions

In this section, we study the nature of dimensions

formed by RSV. Starting from high-dimensional

space formed by the transformed co-occurrence

matrix f(C), word similarities can be measured

by a similarity matrix K=f(CT)f(C)whose

leading eigenvectors, corresponding to the leading

right singular vectors of f(C), form the RSV word

vectors. It suggests that RSV dimensions measure

a typical kind of word similarity on the basis of

variability of word’s contexts, since the eigenvec-

tors of Kaccount for the directions of largest vari-

ance in the word vectors deﬁned by f(C).

To assess the validity of this statement, we study

the dimensions individually. For each dimension,

we ﬁrst project all unit-sized word vectors onto it

and then sort the resulting data in ascending order

to see if any syntactic or semantic regularities can

be seen. Table 6 shows 10 words appearing in the

head of ordered lists related to the ﬁrst 10 dimen-

sions. The dimensions are indexed according to

their related singular vectors. The table shows that

to some extent the dimensions match syntactic and

semantic word categories discussed in linguistics.

There is a direct relation between the indices and

the variability of word’s contexts. The regulari-

ties between the words appearing in highly vari-

able contexts, mostly the high frequency words,

are captured by the leading dimensions.

To a large extent, the ﬁrst dimension accounts

for the degree of variability of word’s contexts.

Lower numbers are given to words that appear

in highly ﬂexible contexts (i.e., high frequency

words such as as, but, in and ...). Dimensions 2–

5are informative for determining syntactic cate-

gories such as adjectives, proper nouns, function

words, and verbs. Dimensions 2and 8give lower

numbers to proper names (interestingly, mostly

last names in 2and male ﬁrst names in 8). Some

kind of semantic regularity can also be seen in

most dimensions. For example, adjectives in di-

mension 2are mostly connected with society,

nouns in dimension 6denote humans, nouns in di-

mension 7are abstract, and words in dimension 9

are mostly connected to affective emotions.

8 Related Work on Word Vectors

The two dominant approaches to creating word

vectors (or word embeddings) are: 1) incremental

methods that update a set of randomly initialized

word vectors while scanning a corpus (Mikolov

et al., 2013; Collobert et al., 2011), and 2) batch

methods that extract a set of word vectors from a

co-occurrence matrix (Pennington et al., 2014; Le-

bret and Collobert, 2014). Pennington et al. (2014)

show that both approaches are closely related to

Dim Top 10 words

1 . – , is as but in ... so

2 domestic religious civilian russian physical social iraqi

japanese mexican scientiﬁc

3 mitchell reid evans allen lawrence palmer duncan rus-

sell bennett owen

4 . but in – the and or as at ,

5 ’s believes thinks asks wants replied tries says v. agrees

6 citizens politicians ofﬁcials deputy businessmen law-

makers former elected lawyers politician

7 cooperation policy reforms policies funding reform ap-

proval compliance oversight assistance

8 geoff ron doug erik brendan kurt jeremy brad ronnie

yuri

9 love feeling sense answer manner desire romantic emo-

tional but ...

10 have were are but – . and may will

Table 6: Top 10 words projected on the top 10 dimensions

each other. Here, we elaborate the connections

between RSV and HPCA (Lebret and Collobert,

2014) and GloVe (Pennington et al., 2014).

HPCA performs Hellinger transformation fol-

lowed by principal component analysis on co-

occurrence matrix Cas below:

Y=SV T(3)

where Yis the matrix of word vecors, and Sand

Vare matrices of top singular values and right sin-

gular vectors of 2

√C. Since the word vectors are to

be used by a neural network, Lebret and Collobert

(2014) recommend to normalize them to avoid the

saturation problem in the network weights (Le-

Cun et al., 2012). Denoting ˜

Yas the empirical

mean of the column vectors in Yand σ(Y)as their

standard deviation, Eq. 4 suggested by Lebret and

Collobert (2014) normalizes the elements of word

vectors to have zero mean and a ﬁxed standard de-

viation of λ≤1.

Υ = λ(Y−˜

σ(Y)(4)

Yis 0if one centres the column vectors in 2

√C

around their mean before performing PCA. Sub-

stituting Eq. 3 into Eq. 4 and the facts that ˜

Y=0

and σ(Y) = 1

√n−1S, where nis the number of

words, we reach Eq. 1.

In general, one can argue that RSV generalises

the idea of Hellinger transformation used in HPCA

through a set of more general transformation func-

tions. Other differences between RSV and HPCA

are in 1) how they form the co-occurrence matrix

C, and 2) when they centre the data. For each

word wiand each context word wj,Ci,j in RSV

is p(wj|wi), but p(wi|wj)in HPCA. In RSV, the

column vectors of f(C)are centred around their

means before performing SVD, but in HPCA, the

data are centred after performing PCA. In Sec. 6,

we showed that these changes result in signiﬁcant

improvement in the quality of word vectors.

The connections between RSV and GloVe is as

follows. GloVe extracts word vectors from a co-

occurrence matrix transformed by logarithm func-

tion. Using a global regression model, Pennington

et al. (2014) argue that linear directions of mean-

ings is captured by the matrix of word vectors Υn,k

with following property:

ΥTΥ = log(C) + b1 (5)

where, Cn,n is the co-occurrence matrix, bn,1is a

bias vector, and 11,n is a vector of ones. Denoting

Υias ith column of Υand assuming kΥik= 1 for

i= 1 . . . n, the left-hand side of Eq. 5 measures

the cosine similarity between unit-sized word vec-

tors Υiin a kernel space and the right-hand side

is the corresponding kernel matrix. Using ker-

nel principal component analysis (Sch¨olkopf et al.,

1998), a k-dimensional estimation of Υin Eq. 5 is

Υ = √SV T(6)

where Sand Vare the matrices of top singular

values and singular vectors of K. Replacing the

kernel matrix in Eq. 5 with the second degree poly-

nomial kernel K=f(CT)f(C), which measures

the similarities on the basis of the column vectors

deﬁned by the co-occurrence matrix, the word vec-

tors generated by Eq. 6 and Eq. 1 are distributed in

the same directions but with different variances. It

shows that the main difference between RSV and

GloVe is in the kernel matrices they are using.

9 Conclusion

In this paper, we have proposed to form a set of

word vectors from the right singular vectors of a

co-occurrence matrix that is transformed by a 7th-

root transformation function. It has been shown

that the proposed method is closely related to pre-

vious methods of word embedding such as HPCA

and GloVe. Our experiments on the task of de-

pendency parsing show that the parsing models

trained with our word vectors are more accurate

than the parsing models trained with other popular

methods of word embedding.

References

Abolfazl AleAhmad, Hadi Amiri, Ehsan Darrudi, Ma-

soud Rahgozar, and Farhad Oroumchian. 2009.

Hamshahri: A standard persian text collection.

Knowledge-Based Systems, 22(5):382–387.

Taylor Berg-Kirkpatrick, David Burkett, and Dan

Klein. 2012. An empirical investigation of statis-

tical signiﬁcance in nlp. In Proceedings of the 2012

Joint Conference on Empirical Methods in Natu-

ral Language Processing and Computational Natu-

ral Language Learning, EMNLP-CoNLL ’12, pages

995–1005, Stroudsburg, PA, USA. Association for

Computational Linguistics.

Danqi Chen and Christopher Manning. 2014. A fast

and accurate dependency parser using neural net-

works. In Proceedings of the 2014 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), pages 740–750.

Ronan Collobert, Jason Weston, L´eon Bottou, Michael

Karlen, Koray Kavukcuoglu, and Pavel Kuksa.

2011. Natural language processing (almost) from

scratch. The Journal of Machine Learning Re-

search, 12:2493–2537.

Marie-Catherine De Marneffe and Christopher D Man-

ning. 2010. Stanford typed dependencies manual

(2008). URL: http://nlp.stanford.edu/

software/dependencies_manual.pdf.

Marie-Catherine De Marneffe, Bill MacCartney,

Christopher D Manning, et al. 2006. Generat-

ing typed dependency parses from phrase structure

parses. In Proceedings of LREC, volume 6, pages

449–454.

Richard Johansson and Pierre Nugues. 2007. Ex-

tended con stituent-to-dependency conversion for en-

glish. In 16th Nordic Conference of Computational

Linguistics, pages 105–112. University of Tartu.

R´emi Lebret and Ronan Collobert. 2014. Word em-

beddings through hellinger pca. In Proceedings of

the 14th Conference of the European Chapter of the

Association for Computational Linguistics, pages

482–490, Gothenburg, Sweden, April. Association

for Computational Linguistics.

R´emi Lebret and Ronan Collobert. 2015. Rehabilita-

tion of count-based models for word vector repre-

sentations. In Computational Linguistics and Intel-

ligent Text Processing, pages 417–429. Springer.

Yann A LeCun, L´eon Bottou, Genevieve B Orr, and

Klaus-Robert M¨uller. 2012. Efﬁcient backprop. In

Neural networks: Tricks of the trade, pages 9–48.

Springer.

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann

Marcinkiewicz. 1993. Building a large annotated

corpus of English: The Penn treebank. Computa-

tional Linguistics - Special issue on using large cor-

pora, 19(2):313 – 330, June.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey

Dean. 2013. Efﬁcient estimation of word represen-

tations in vector space. In Proceedings of Workshop

at ICLR.

Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.

Maltparser: A data-driven parser-generator for de-

pendency parsing. In Proceedings of the 5th In-

ternational Conference on Language Resources and

Evaluation (LREC), pages 2216–2219.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-

ter, Yoav Goldberg, Jan Hajic, Christopher D Man-

ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,

Natalia Silveira, et al. 2016. Universal dependen-

cies v1: A multilingual treebank collection. In Pro-

ceedings of the 10th International Conference on

Language Resources and Evaluation (LREC 2016).

Joakim Nivre. 2004. Incrementality in deterministic

dependency parsing. In Proceedings of the Work-

shop on Incremental Parsing: Bringing Engineering

and Cognition Together, pages 50–57. Association

for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christopher D

Manning. 2014. Glove: Global vectors for word

representation. In EMNLP, volume 14, pages 1532–

1543.

Bernhard Sch¨olkopf, Alexander Smola, and Klaus-

Robert M¨uller. 1998. Nonlinear component anal-

ysis as a kernel eigenvalue problem. Neural compu-

tation, 10(5):1299–1319.

Mojgan Seraji. 2015. Morphosyntactic Corpora and

Tools for Persian. Ph.D. thesis, Uppsala University.

Milan Straka, Jan Hajic, Jana Strakov´a, and Jan Ha-

jic jr. 2015. Parsing universal dependency treebanks

using neural networks and search-based oracle. In

International Workshop on Treebanks and Linguis-

tic Theories (TLT14), pages 208–220.

Milan Straka, Jan Hajic, and Jana Strakov. 2016.

Udpipe: Trainable pipeline for processing conll-u

ﬁles performing tokenization, morphological anal-

ysis, pos tagging and parsing. In Nicoletta Cal-

zolari (Conference Chair), Khalid Choukri, Thierry

Declerck, Sara Goggi, Marko Grobelnik, Bente

Maegaard, Joseph Mariani, Helene Mazo, Asun-

cion Moreno, Jan Odijk, and Stelios Piperidis, edi-

tors, Proceedings of the Tenth International Confer-

ence on Language Resources and Evaluation (LREC

2016), Paris, France, may. European Language Re-

sources Association (ELRA).

A Tropp, N Halko, and PG Martinsson. 2009. Find-

ing structure with randomness: Stochastic algo-

rithms for constructing approximate matrix decom-

positions. Technical report, Applied & Computa-

tional Mmathematics, California Institute of Tech-

nology.

Linguistic Information in Word Embeddings

Book

Dec 2019

We study the presence of linguistically motivated information in the word embeddings generated with statistical methods. The nominal aspects of uter/neuter, common/proper, and count/mass in Swedish are selected to represent respectively grammatical, semantic, and mixed types of nominal categories within languages. Our results indicate that typical grammatical and semantic features are easily captured by word embeddings. The classification of semantic features required significantly less neurons than grammatical features in our experiments based on a single layer feed-forward neural network. However, semantic features also generated higher entropy in the classification output despite its high accuracy. Furthermore, the count/mass distinction resulted in difficulties to the model, even though the quantity of neurons was almost tuned to its maximum.

Real-valued syntactic word vectors

Article

Full-text available

Aug 2019

We introduce a word embedding method that generates a set of real-valued word vectors from a distributional semantic space. The semantic space is built with a set of context units (words) which are selected by an entropy-based feature selection approach with respect to the certainty involved in their contextual environments. We show that the most predictive context of a target word is its preceding word. An adaptive transformation function is also introduced that reshapes the data distribution to make it suitable for dimensionality reduction techniques. The final low-dimensional word vectors are formed by the singular vectors of a matrix of transformed data. We show that the resulting word vectors are as good as other sets of word vectors generated with popular word embedding methods.

Word embedding and neural network on grammatical gender -- A case study of Swedish

Preprint

Full-text available

Jul 2020

We analyze the information provided by the word embeddings about the grammatical gender in Swedish. We wish that this paper may serve as one of the bridges to connect the methods of computational linguistics and general linguistics. Taking nominal classification in Swedish as a case study, we first show how the information about grammatical gender in language can be captured by word embedding models and artificial neural networks. Then, we match our results with previous linguistic hypotheses on assignment and usage of grammatical gender in Swedish and analyze the errors made by the computational model from a linguistic perspective.

Proceedings of the 13'th SweCog Conference

Book

Full-text available

Oct 2017

A typology of classifiers and gender: From description to computation

Book

Mar 2019

Marc Allassonnière-Tang

Categorization is one the most relevant tasks realized by humans during their life, as we consistently need to categorize the things and experience that we encounter. Such need is reflected in language via various mechanisms, the most prominent being nominal classification systems (e.g., grammatical gender such as the masculine/feminine distinction in French). Typological methods are used to investigate the underlying functions and structures of such systems, using a wide variety of cross-linguistic data to examine universality and variability. This analysis is itself a classification task, as languages are categorized and clustered according to their grammatical features. This thesis provides a cross-linguistic typological analysis of nominal classification systems and in parallel compares a number of quantitative methods that can be applied at different scales. First, this thesis provides an analysis of nominal classification systems (i.e., gender and classifiers) via the description of three languages with respectively gender, classifiers, and both. While the analysis of the first two languages are more of a descriptive nature and aligns with findings in the existing literature, the third language provides novel insights to the typology of nominal classification systems by demonstrating how classifiers and gender may co-occur in one language in terms of distribution of functions. Second, the underlying logic of nominal classification systems is commonly considered difficult to investigate, e.g., is there a consistent logic behind gender assignment in language? is it possible to explain the distribution of classifier languages of the world while taking into account geographical and genealogical effects? This thesis addresses the lack of arbitrariness of nominal classification systems at three different scales: The distribution of classifiers at the worldwide level, the presence of gender within a language family, and gender assignment at the language-internal level. The methods of random forests, phylogenetics, and word embeddings with neural networks are selected since they are respectively applicable at three different scales of research questions (worldwide, family-internal, language-internal).

Linguistically Informed Neural Dependency Parsing for Typologically Diverse Languages

Thesis

Full-text available

Oct 2019

Miryam de Lhoneux

This thesis presents several studies in neural dependency parsing for typologically diverse languages, using treebanks from Universal Dependencies (UD). The focus is on informing models with linguistic knowledge. We first extend a parser to work well on typologically diverse languages, including morphologically complex languages and languages whose treebanks have a high ratio of non-projective sentences, a notorious difficulty in dependency parsing. We propose a general methodology where we sample a representative subset of UD treebanks for parser development and evaluation. Our parser uses recurrent neural networks which construct information sequentially, and we study the incorporation of a recursive neural network layer in our parser. This follows the intuition that language is hierarchical. This layer turns out to be superfluous in our parser and we study its interaction with other parts of the network. We subsequently study transitivity and agreement information learned by our parser for auxiliary verb constructions (AVCs). We suggest that a parser should learn similar information about AVCs as it learns for finite main verbs. This is motivated by work in theoretical dependency grammar. Our parser learns different information about these two if we do not augment it with a recursive layer, but similar information if we do, indicating that there may be benefits from using that layer and we may not yet have found the best way to incorporate it in our parser. We finally investigate polyglot parsing. Training one model for multiple related languages leads to substantial improvements in parsing accuracy over a monolingual baseline. We also study different parameter sharing strategies for related and unrelated languages. Sharing parameters that partially abstract away from word order appears to be beneficial in both cases but sharing parameters that represent words and characters is more beneficial for related than unrelated languages.

Linguistic explorations in real-valued syntactic word vectors (RSV)

Conference Paper

Full-text available

Nov 2018

We study the presence of information provided by word embeddings from real-valued syntactic word vectors for determining the grammatical gender of nouns in Swedish. Our investigation reveals that regardless of being a frequently used word or not, real-valued syntactic word vectors are highly informative for identifying the grammatical gender of nouns. By using a neural network classifier we show that the uncertainty involved in the output of the network is only weakly correlated with the frequency level of words. Moreover, a linguistic analysis of errors demonstrates that while half of the errors can be avoided by using POS tag of words, the remaining errors are linguistically motivated and require extra information about the context of words.

A Generalized Principal Component Analysis for Word Embedding

Conference Paper

Full-text available

Nov 2018

Ali Basirat

Word embeddings are fundamental objects in neural natural language processing approaches. Despite the fact that word embedding methods follow the same principles, we see in practice that most of the methods that use PCA are not as successful as the methods that are developed in the area of language modelling and make use of neural networks to train word embeddings. In this paper, we address the limiting factors of PCA for word embedding and propose solutions to mitigate those factors. Our experimental results show that principal word embeddings generated with our approach are better than or as good as other sets of word embeddings when they are used in different NLP tasks.

Analyzing credit risk among Chinese P2P-lending businesses by integrating text-related soft information

Article

Feb 2020
ELECTRON COMMER R A

Text-related soft information effectively alleviates the information asymmetry associated with P2P lending and reduces credit risk. Most existing studies use nonsemantic text information to construct credit evaluation models and predict the borrower's level of risk. However, the semantic information also reflect the ability and willingness of borrowers to repay and might be able to explain borrowers’ credit statuses. This paper examines whether semantic loan description text information helps predict the credit risk of different types of borrowers using a Chinese P2P platform. We use the 5P credit evaluation theory and the word embedding model to extract the semantic features of loan descriptions across five dimensions. Then, the AdaBoost ensemble learning strategy is applied to construct a credit evaluation model to improve the learning performance of an intelligent algorithm. The extracted semantic features are integrated into the evaluation model to study their explanatory ability with regard to the credit status of different types of borrowers. We conducted empirical research on the Renrendai P2P platform. Our conclusions show that the semantic features of textual soft information significantly improve the predictability of credit evaluation models and that the promotion effect is most significant for first-time borrowers. This paper has important practical significance for P2P platforms and the credit risk management of lenders. Furthermore, it has theoretical value for research concerning heterogeneous information-based credit risk analysis methods in big data environments.

Linguistic Information in Word Embeddings: 10th International Conference, ICAART 2018, Funchal, Madeira, Portugal, January 16 – 18, 2018, Revised Selected Papers

Chapter

Jan 2019

Glove: Global Vectors for Word Representation

Conference Paper

Full-text available

Jan 2014

Morphosyntactic Corpora and Tools for Persian

Book

Full-text available

May 2015

Mojgan Seraji

The book is available electronically at the following address: http://uu.diva-portal.org/smash/record.jsf?pid=diva2%3A800998&dswid=-1537

Rehabilitation of Count-Based Models for Word Vector Representations

Conference Paper

Full-text available

Dec 2014

Recent works on word representations mostly rely on predictive models. Distributed word representations (aka word embeddings) are trained to optimally predict the contexts in which the corresponding words tend to appear. Such models have succeeded in capturing word similarties as well as semantic and syntactic regularities. Instead, we aim at reviving interest in a model based on counts. We present a systematic study of the use of the Hellinger distance to extract semantic representations from the word co-occurence statistics of large text corpora. We show that this distance gives good performance on word similarity and analogy tasks, with a proper type and size of context, and a dimensionality reduction based on a stochastic low-rank approximation. Besides being both simple and intuitive, this method also provides an encoding function which can be used to infer unseen words or phrases. This becomes a clear advantage compared to predictive models which must train these new words.

Natural Language Processing (Almost) from Scratch

Article

Full-text available

Feb 2011
J MACH LEARN RES

We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including: part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

Efficient Estimation of Word Representations in Vector Space

Conference Paper

Jan 2013

We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing

Conference Paper

May 2016

Automatic natural language processing of large texts often presents recurring challenges in multiple languages: even for most advanced tasks, the texts are first processed by basic processing steps – from tokenization to parsing. We present an extremely simple-to-use tool consisting of one binary and one model (per language), which performs these tasks for multiple languages without the need for any other external data. UDPipe, a pipeline processing CoNLL-U-formatted files, performs tokenization, morphological analysis, part-of-speech tagging, lemmatization and dependency parsing for nearly all treebanks of Universal Dependencies 1.2 (namely, the whole pipeline is currently available for 32 out of 37 treebanks). In addition, the pipeline is easily trainable with training data in CoNLL-U format (and in some cases also with additional raw corpora) and requires minimal linguistic knowledge on the users’ part. The training code is also released.

Parsing Universal Dependency Treebanks using Neural Networks and Search-Based Oracle

Conference Paper

Jan 2015

We describe a transition-based, non-projective dependency parser which uses a neural network classifier for prediction and requires no feature engineering. We propose a new, search-based oracle, which improves parsing accuracy similarly to a dynamic oracle, but is applicable to any transition system, such as the fully non-projective swap system. The parser has excellent parsing speed, compact models, and achieves high accuracy without requiring any additional resources such as raw corpora. We tested it on all 19 treebanks of the Universal Dependencies project. The C++ implementation of the parser is being released as an open-source tool.

A Fast and Accurate Dependency Parser using Neural Networks

Conference Paper

Jan 2014

An empirical investigation of statistical significance in NLP

Conference Paper

Jul 2012

We investigate two aspects of the empirical behavior of paired significance tests for NLP systems. First, when one system appears to outperform another, how does significance level relate in practice to the magnitude of the gain, to the size of the test set, to the similarity of the systems, and so on? Is it true that for each task there is a gain which roughly implies significance? We explore these issues across a range of NLP tasks using both large collections of past systems' outputs and variants of single systems. Next, once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed, such as across domains? We explore this question using a range of test set variations for constituency parsing.

Stanford typed dependencies manual

Article

Jan 2008

Real-valued Syntactic Word Vectors (RSV) for Greedy Neural Dependency Parsing

Abstract

Recommended publications

Greedy Universal Dependency Parsing with Right Singular Word Vectors

Real-valued syntactic word vectors

Principal Word Vectors

Principal Word Vectors