Conference PaperPDF Available

A Two-Step Named Entity Recognizer for Open-Domain Search Queries

October 2013

October 2013

Conference: International Joint Conference on Natural Language Processing

Authors:

Universidad Tecnica Metropolitana

Named entity recognition in queries is the task of identifying sequences of terms in search queries that refer to a unique con-cept. This problem is catching increas-ing attention, since the lack of context in short queries makes this task difficult for full-text off-the-shelf named entity recog-nizers. In this paper, we propose to deal with this problem in a two-step fashion. The first step classifies each query term as token or part of a named entity. The sec-ond step takes advantage of these binary labels for categorizing query terms into a pre-defined set of 28 named entity classes. Our results show that our two-step strategy is promising by outperforming a one-step traditional baseline by more than 10%.

Attributes selected by the greedy algorithm and their respective contribution (baseline). See also table 2 for id-feature mappings.

…

Attributes selected by the greedy algorithm and their respective contribution (NERQ2S). See also table 2 for id-feature mappings. The word "new" denotes the binary token/entity attribute determined in the first step.

…

Figures - uploaded by Alejandro Figueroa

Content may be subject to copyright.

Content uploaded by Alejandro Figueroa

Content may be subject to copyright.

International Joint Conference on Natural Language Processing, pages 829–833,

Nagoya, Japan, 14-18 October 2013.

A Two-Step Named Entity Recognizer for Open-Domain Search Queries

Andreas Eiselt

Yahoo! Research Latin America

Av. Blanco Encalada 2120,

Santiago, Chile

eiselt@yahoo-inc.com

Alejandro Figueroa

Yahoo! Research Latin America

Av. Blanco Encalada 2120,

Santiago, Chile

afiguero@yahoo-inc.com

Abstract

Named entity recognition in queries is the

task of identifying sequences of terms in

search queries that refer to a unique con-

cept. This problem is catching increas-

ing attention, since the lack of context in

short queries makes this task difﬁcult for

full-text off-the-shelf named entity recog-

nizers. In this paper, we propose to deal

with this problem in a two-step fashion.

The ﬁrst step classiﬁes each query term as

token or part of a named entity. The sec-

ond step takes advantage of these binary

labels for categorizing query terms into a

pre-deﬁned set of 28 named entity classes.

Our results show that our two-step strategy

is promising by outperforming a one-step

traditional baseline by more than 10%.

1 Introduction

Search engines are key players in serving as in-

terface between users and web resources. Hence,

they started to take on the challenge of modelling

user interests and enhance their search experience.

This is one of the main drivers of replacing the

classical document-keyword matching, a.k.a. bag-

of-word approach, with user-oriented strategies.

Speciﬁcally, these changes are geared towards im-

proving the precision, contextualization, and per-

sonalization of the search results. To achieve this,

it is vital to identify fundamental structures such as

named entities (e.g., persons, locations and organi-

zations) (Hu et al., 2009). Indeed, previous studies

indicate that over 70% of all queries contain enti-

ties (Guo et al., 2009; Yin and Shah, 2010).

Search queries are on average composed of 2-

3 words, yielding few context and breaking the

grammatical rules of natural language (Guo et al.,

2009; Du et al., 2010). Thus, named entity recog-

nizers for relatively lengthy grammatically well-

formed documents perform poorly on the task of

Named Entity Recognition in Queries (NERQ).

At heart, the contribution of this work is a novel

supervised approach to NERQ, trained with a large

set of manually tagged queries and consisting of

two steps: 1) performs a binary classiﬁcation,

where each query term is tagged as token/entity

depending on whether or not it is part of a named

entity; and 2) takes advantage of these binary to-

ken/entity labels for categorizing each term within

the query into one of a pre-deﬁned set of classes.

2 Related Work

To the best of our knowledge, there have been a

few previous research efforts attempting to recog-

nize named entities in search queries. This prob-

lem is relatively new and it was ﬁrst introduced by

(Pas¸ca, 2007). Their weakly supervised method

starts with an input class represented by a set of

seeds, which are used to induce typical query-

contexts for the respective input category. Con-

texts are then used to acquire and select new can-

didate instances for the corresponding class.

In their pioneer work, (Guo et al., 2009) fo-

cused on queries that contain only one named en-

tity belonging to four classes (i.e., movie, game,

book and song). As for learning approach, they

employed weakly supervised topic models using

partially labeled seed named entities. These topic

models were trained using query log data corre-

sponding to 120 seed named entities (another 60

for testing) selected from three target web sites.

Later, (Jain and Pennacchiotti, 2010) extended this

approach to a completely unsupervised and class-

independent method.

In another study, (Du et al., 2010) tackled

the lack of context in short queries by interpret-

ing query sequences in the same search session

as extra contextual information. They capital-

ized on a collection of 6,000 sessions containing

only queries targeted at the car model domain.

829

They trained Conditional Random Field (CRF)

and topic models, showing that using search

sessions improves the performance signiﬁcantly.

More recent, (Alasiry et al., 2012a; Alasiry et al.,

2012b) determined named entity boundaries, com-

bining grammar annotation, query segmentation,

top ranked snippets from search engine results in

conjunction with a web n-gram model.

In contrast, we do not proﬁt from seed named

entities nor web search results, but rather from

a large manually annotated collection of about

80,000 open-domain queries. We consider search

queries containing multiple named entities, and

we do not beneﬁt from search sessions. Further-

more, our approach performs two labelling steps

instead of a straightforward one-step labelling.

The ﬁrst step checks if each query term is part of a

named entity or not, while the second assigns each

term to one out of a set of 291classes by taking

into account the outcome of the ﬁrst step.

3 NERQ-2S

NERQ-2S is a two-step named entity recognizer

for open-domain search queries. First, it differ-

entiates named entity terms from other types of

tokens (e.g., word and numbers) on the basis of

a CRF2trained with manually annotated data. In

the second step, NERQ-2S incorporates the out-

put of this CRF into a new CRF as a feature. This

second CRF assigns each term within the query to

one out of 29 pre-deﬁned categories. In essence,

considering these automatically computed binary

entity/token labels seeks to inﬂuence the second

model so that the overall performance is improved.

Given the fact that binary entity/token tags are

only used as additional contextual evidence by the

second CRF, these labels can be reverted in the

second step. NERQ-2S identiﬁes 28 named entity

classes that are prominent in search engine open-

domain queries (see table 1). This set of categories

was deliberately chosen as a means of enriching

search results regarding general user interests, and

thus aimed at providing a substantially better over-

all user experience. In particular, named entities

are normally utilized for devising the lay-out and

the content of the result page of a search engine.

1In actuality, we considered 29 classes: 28 regards named

entities and one class for non-entity (token). For the sake of

readability, from now on, we say indistinctly that the second

step identiﬁes 28 named entity classes or 29 classes.

2CRFsuite: http://www.chokkan.org/software/crfsuite

At both steps, NERQ-2S uses a CRF as classi-

ﬁer and a set of properties, which was determined

separately for each classiﬁer by executing a greedy

feature selection algorithm (see next section). For

both CRFs, this algorithm contemplated as candi-

dates the 24 attributes explained in table 2. Ad-

ditionally, in the case of the second CRF, this al-

gorithm took into account the entity/token feature

produced by the ﬁrst CRF. Note that features in

table 2 are well-known from other named entity

recognition systems (Nadeau and Sekine, 2007).

4 Experiments

In all our experiments, we carried out a 10-fold

cross-validation. As for data-sets, we bene-

ﬁted from a collection comprising 82,413 queries,

which are composed of 242,723 terms3. These

queries were randomly extracted from the query

log of a commercial search engine, and they are

exclusively in English. In order to annotate our

query collection, these queries were ﬁrst tok-

enized, and then each term was manually tagged

by an editorial team using the schema adopted in

(Tjong Kim Sang and De Meulder, 2003).

Attributes were selected by exploiting a greedy

algorithm. This procedure starts with an empty

bag of properties and after each iteration adds the

one that performs the best. In order to determine

this feature, this procedure tests each non-selected

attribute together with all the properties in the bag.

The algorithm stops when there is no non-selected

feature that enhances the performance.

11 10 2 1 0 4 13 3 7 17 19 21 12

0,50

0,55

0,60

0,65

0,70

0,519

0,561

0,599

0,621

0,634

0,642

0,648

0,652

0,656

0,657

0,658

0,659

Feature ID

F(1)-Score

Figure 1: Attributes selected by the greedy algo-

rithm and their respective contribution (baseline).

See also table 2 for id-feature mappings.

As for a baseline, we used a traditional one-

step approach grounded on CRF enriched with 13

3Due to privacy laws, query logs cannot be made public.

830

ID Name Example ID Name Example

0 Airline Code AA, LA, JJ 15 Food Sushi, Bread, Dessert

1 Beverage Cocktails, Beer 16 Food Ingredient Honey, Avocado

2 Brand Name Bacardi, Apple 17 Food Taste Sweet, Cheesy

3 Business Hotel, Newspaper 18 Horoscope Sign Libra, Taurus

4 Cooking Method Pressure Cooking 19 Measurement Name Inches, Kilogram

5 Cuisine Mexican, German 20 Media Title Age of Empires 2

6 Currency Name Dollar, Euros, Pesos 21 Occasion Festival, Ceremony

7 Diet Vegan, Fat free 22 Organization Name Yahoo, Caf Soleil

8 Disease and Condition Cancer, Diabetic 23 Person Name Marry Poppins

9 Dish Ratatouille, Tiramisu 24 Phone Number 3153423595

10 Domain forbes.com, lan.com 25 Place Name Chile, Berlin

11 Drink Bloody Mary, Sangria 26 Product Camera, Cell phone

12 Email Address john.doe@example.com 27 Treatment Steroids, Surgery

13 Event Name Christmas, Super Bowl 28 Token (no NE-class) how, to, image

14 File Name msimn.exe, .htaccess

Table 1: Named entity classes recognized by NERQ-2S.

out of our 24 features (see table 2), which were

chosen by running our greedy feature selection al-

gorithm. Figure 1 shows the order that these 13

features were chosen, and their respective impact

on the performance. Regarding these results, it is

worth highlighting the following ﬁndings:

1. The ﬁrst feature selected by the greedy algo-

rithm models each term by its non-numerical

characters (id=11 in table 2). This attribute

helps to correctly tag 80.42% of the terms

when they are modiﬁed (numbers removed).

2. The third chosen feature considers the value

of the following word, when tagging a term

(id=2 in table 2). This attribute helps to cor-

rectly annotate 79.68%, 74.55% and 74.87%

of tokens belonging to person, place and or-

ganization names, respectively.

3. Our ﬁgures also point out to the relevance of

the three word features (id=0,1,2 in table 2).

These features were selected in a row, boost-

ing the performance from F(1) = 0.561 to

F(1) = 0.634, a 13.01% increase with re-

spect to the previously selected properties.

In summary, the performance of the one-step

baseline is F(1) = 0.659. In contrast, ﬁgure

2 highlights the 16 out of the 25 features uti-

lized by the second phase of NERQ-2S. Note that

the “new” bar indicates the token/entity attribute

determined in the ﬁrst step. Most importantly,

NERQ-2S ﬁnished with an F(1) = 0.729, which

means a 10.62% enhancement with respect to the

one-step baseline. From these results, it is worth

considering the following aspects:

1. In terms of features, 11 of the 13 attributes

used by the one-step baseline were also

exploited by NERQ-2S. Further, NERQ-2S

proﬁts from four additional properties that

were also available for the one-step baseline.

2. The ﬁve more prominent properties selected

by the baseline, were also chosen by NERQ-

2S with just a slight change in order.

3. The “new” feature achieves an improvement

of 23.51% (F(1) = 0.641) with respect to

the previous selected property. The impact

of the entity/token attribute can be measure

when compared with the performance ac-

complished by the ﬁrst ﬁve features selected

by the baseline (F(1) = 0.634).

In light of these results, we can conclude that: a)

adding the entity/token feature to the CRF is vital

for boosting the performance, making a two-step

approach a better solution than the traditional one-

step approach; and b) this entity/token property is

complementary to the list shown in table 2.

The confusion matrix for NERQ-2S shows that

errors, basically, regard highly ambiguous terms.

Some interesting misclassiﬁcations:

1. Overall, 17.38% of the terms belonging to

place names were mistagged by NERQ-2S.

From these, 72.11% were perceived as part

of organization names.

2. On the other hand, 17.27% of the terms cor-

responding to organization names were mis-

labelled by NERQ-2S. Here, 15.52% and

12.84% of these errors were due to the fact

that these terms were seen as tokens and parts

of place names, respectively.

831

ID Feature Example

Word Features

0 Current term (ti) abc123

1 Previous term (ti−1) before

2 Next word (ti+1 ) after

N-grams

3 Bi-gram of ti−1and tibefore abc123

4 Bi-gram of tiand ti+1 abc123 after

Pre- & Postﬁx

5 1 leftmost character from tia

6 2 leftmost characters from tiab

7 3 leftmost characters from tiabc

8 1 rightmost character from ti3

9 2 rightmost characters from ti23

10 3 rightmost characters from ti123

Reductions

11 tiwithout digits abc

12 tiwithout letters 123

Word Shape

13 Shape of ti(“a” represents letters; aaa000

“0” digits, “-” special characters)

14 Shape of ti(same elements joined) a0

Position & Lengths

15 Position of tifrom left 3

16 Position of tifrom right 2

17 Character length of ti6

Boolean

18 tiis a number? (only digits) false

19 tiis a word? (only letters) false

20 tiis a mixture of letters and digits? true

21 ticontains “.”? false

22 ticontains apostrophe? false

23 ticontains other special characters? false

Table 2: List of used features. Examples are for

the third term of query “ﬁrst before abc123 after”.

11new10 2 0 1 7 4 13 3 21 14 8 16 19 18

0,50

0,55

0,60

0,65

0,70

0,75

0,80

0,519

0,641

0,682

0,698

0,708

0,716

0,720

0,723

0,727

0,728

0,729

Feature ID

F(1)-Score

Figure 2: Attributes selected by the greedy al-

gorithm and their respective contribution (NERQ-

2S). See also table 2 for id-feature mappings. The

word “new” denotes the binary token/entity at-

tribute determined in the ﬁrst step.

Incidentally, NERQ-2S mislabelled 10.40% of

the tokens (non-named entity terms), while the

one-step baseline 17.57%. This difference signals

the importance of ﬁrst-step consisting of an spe-

cialized and efﬁcient token/entity term annotator.

With regard to the ﬁrst step of NERQ-2S, nine out

of the 24 properties were useful, and the ﬁrst step

ﬁnished with an F(1) = 0.8077. From these nine

attributes, eight correspond to the top eight fea-

tures used by our one-step baseline, and one extra

attribute (id=20). Thus, the discriminative proba-

bilistic model learned in this ﬁrst step is more spe-

cialized for this task. That is to say, though the

context of a term might be modelled similarly, the

parameters of the CRF model are different.

The confusion matrix for this binary classiﬁer

shows that 11.44% of entity terms were mistagged

as token, while 22.24% of tokens as entity terms.

This means a higher percentage of errors comes

from mislabelled tokens.

On a ﬁnal note, as a means of quantifying the

impact of the ﬁrst step on NERQ-2S, we replaced

the output given by the ﬁrst CRF model with the

manual binary token/annotations given by the edi-

torial team. In other words, the “new” feature is

now a manual input instead of an automatically

computed property. By doing this, NERQ-2S in-

creases the performance from F(1) = 0.729 to

F(1) = 0.809, which means 10.97% better than

NERQ-2S and 22.76% than the one-step base-

line. This corroborates that a two-step approach

to NERQ is promising.

5 Conclusions and Further Work

This paper presents NERQ-2S, a two-step ap-

proach to the problem of recognizing named enti-

ties in search queries. In the ﬁrst stage, NERQ-2S

checks as to whether or not each query term be-

longs to a named entity, and in the second phase,

it categorizes each token according to a set of pre-

deﬁned classes. These classes are aimed at en-

hancing the user experience with the search engine

in contrast to previous pre-deﬁned categories.

Our results indicate that our two-step approach

outperforms the typical one-step NERQ. Since our

error analysis indicates that there is about 11% of

potential global improvement by boosting the per-

formance of the entity/token tagger, one research

direction regards combining the output of distinct

two-sided classiﬁers for improving the overall per-

formance of NERQ-2S.

832

References

Areej Alasiry, Mark Levene, and Alexandra Poulovas-

silis. 2012a. Detecting candidate named entities in

search queries. In SIGIR, pages 1049–1050.

Areej Alasiry, Mark Levene, and Alexandra Poulovas-

silis. 2012b. Extraction and evaluation of candidate

named entities in search engine queries. In WISE,

pages 483–496.

Junwu Du, Zhimin Zhang, Jun Yan, Yan Cui, and

Zheng Chen. 2010. Using search session context for

named entity recognition in query. In Proceeding

of the 33rd international ACM SIGIR conference on

Research and development in information retrieval -

SIGIR ’10.

Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. 2009.

Named entity recognition in query. In Proceedings

of the 32nd international ACM SIGIR conference on

Research and development in information retrieval -

SIGIR ’09, page 267, New York, New York, USA.

ACM Press.

Jian Hu, Gang Wang, Fred Lochovsky, Jian-Tao Sun,

and Zheng Chen. 2009. Understanding users query

intent with Wikipedia. In Proceedings of WWW-09.

A. Jain and Marco Pennacchiotti. 2010. Open entity

extraction from web search query logs. In Proceed-

ings of the 23rd International Conference on Com-

putational Linguistics, pages 510–518.

David Nadeau and Satoshi Sekine. 2007. A survey

of named entity recognition and classiﬁcation. Lin-

guisticae Investigationes, 30(1):3–26, January. Pub-

lisher: John Benjamins Publishing Company.

Marius Pas¸ca. 2007. Weakly-supervised discovery of

named entities using web search queries. In Pro-

ceedings of the sixteenth ACM conference on Con-

ference on information and knowledge management

- CIKM ’07, page 683, New York, New York, USA.

ACM Press.

Erik F. Tjong Kim Sang and Fien De Meulder.

2003. Introduction to the CoNLL-2003 shared task:

language-independent named entity recognition. In

Proceedings of the seventh conference on Natural

language learning at HLT-NAACL 2003 - Volume

4, CONLL ’03, pages 142–147, Stroudsburg, PA,

USA. Association for Computational Linguistics.

Xiaoxin Yin and Sarthak Shah. 2010. Building Taxon-

omy of Web Search Intents for Name Entity Queries.

In Proceedings of WWW-2010.

833

SMAPH: A Piggyback Approach for Entity-Linking in Web Queries

Article

Dec 2018

We study the problem of linking the terms of a web-search query to a semantic representation given by the set of entities (a.k.a. concepts) mentioned in it. We introduce SMAPH, a system that performs this task using the information coming from a web search engine, an approach we call “piggybacking.” We employ search engines to alleviate the noise and irregularities that characterize the language of queries. Snippets returned as search results also provide a context for the query that makes it easier to disambiguate the meaning of the query. From the search results, SMAPH builds a set of candidate entities with high coverage. This set is filtered by linking back the candidate entities to the terms occurring in the input query, ensuring high precision. A greedy disambiguation algorithm performs this filtering; it maximizes the coherence of the solution by iteratively discovering the pertinent entities mentioned in the query. We propose three versions of SMAPH that outperform state-of-the-art solutions on the known benchmarks and on the GERDAQ dataset, a novel dataset that we have built specifically for this problem via crowd-sourcing and that we make publicly available.

A Piggyback System for Joint Entity Mention Detection and Linking in Web Queries

Conference Paper

Apr 2016

In this paper we study the problem of linking open-domain web-search queries towards entities drawn from the full entity inventory of Wikipedia articles. We introduce SMAPH-2, a second-order approach that, by piggybacking on a web search engine, alleviates the noise and irregularities that characterize the language of queries and puts queries in a larger context in which it is easier to make sense of them. The key algorithmic idea underlying SMAPH-2 is to first discover a candidate set of entities and then link-back those entities to their mentions occurring in the input query. This allows us to confine the possible concepts pertinent to the query to only the ones really mentioned in it. The link-back is implemented via a collective disambiguation step based upon a supervised ranking model that makes one joint prediction for the annotation of the complete query optimizing directly the F1 measure. We evaluate both known features, such as word embeddings and semantic relatedness among entities, and several novel features such as an approximate distance between mentions and entities (which can handle spelling errors). We demonstrate that SMAPH-2 achieves state-of-the-art performance on the ERD@SIGIR2014 benchmark. We also publish GERDAQ (General Entity Recognition, Disambiguation and Annotation in Queries), a novel, public dataset built specifically for web-query entity linking via a crowdsourcing effort. SMAPH-2 outperforms the benchmarks by comparable margins also on GERDAQ.

An Instance Transfer-Based Approach Using Enhanced Recurrent Neural Network for Domain Named Entity Recognition

Article

Full-text available

Feb 2020

Recently, neural networks have shown promising results for named entity recognition(NER), which needs a number of labeled data to for model training. When meeting a new domain (target domain) for NER, there is no or a few labeled data, which makes domain NER much more difficult. As NER has been researched for a long time, some similar domain already has well labeled data(source domain). Therefore, in this paper, we focus on domain NER by studying how to utilize the labeled data from such similar source domain for the new target domain. We design a kernel function based instance transfer strategy by getting similar labeled sentences from a source domain. Moreover, we propose an enhanced recurrent neural network (ERNN) by adding an additional layer that combines the source domain labeled data into traditional RNN structure. Comprehensive experiments are conducted on two datasets. The comparison results among HMM, CRF and RNN show that RNN performs better than others. When there is no labeled data in domain target, compared to directly using the source domain labeled data without selecting transferred instances, our enhanced RNN approach gets improvement from 0.8052 to 0.9328 in terms of F1 measure.

CloseUp—A Community-Driven Live Online Search Engine

Article

Aug 2019

Search engines are still the most common way of finding information on the Web. However, they are largely unable to provide satisfactory answers to time- and location-specific queries. Such queries can best and often only be answered by humans that are currently on-site. Although online platforms for community question answering are very popular, very few exceptions consider the notion of users’ current physical locations. In this article, we present CloseUp, our prototype for the seamless integration of community-driven live search into a Google-like search experience. Our efforts focus on overcoming the defining differences between traditional Web search and community question answering, namely the formulation of search requests (keyword-based queries vs. well-formed questions) and the expected response times (milliseconds vs. minutes/hours). To this end, the system features a deep learning pipeline to analyze submitted queries and translate relevant queries into questions. Searching users can submit suggested questions to a community of mobile users. CloseUp provides a stand-alone mobile application for submitting, browsing, and replying to questions. Replies from mobile users are presented as live results in the search interface. Using a field study, we evaluated the feasibility and practicability of our approach.

Named Entity Recognition in Local Intent Web Search Queries

Chapter

Full-text available

Aug 2019

Semantic understanding of web queries is a challenging problem as web queries are short, noisy and usually do not observe the grammar of a written language. In this paper, we specifically study the user web search queries with local intent on Bing. Local intent queries deal with searching for local businesses and services in a location. Hence, local query parsing translates into the classical problem of Named Entity Recognition (NER) in NLP. State-of-the-art NER systems rely heavily on hand-crafted features and domain-specific knowledge to effectively learn from the small, supervised training corpora that is available. In this paper, we use deep learnt neural model that relies solely on features extracted from word embeddings learnt in an unsupervised way, using search logs. We propose a novel technique for generating domain specific embeddings and show that they significantly improve the performance of existing models for the NER task. Our model outperforms the existing CRF based parser currently used in production.

Neural Transition Based Parsing of Web Queries: An Entity Based Approach

Conference Paper

Jan 2018

A Methodology for the Resolution of Cashtag Collisions on Twitter – A Natural Language Processing & Data Fusion Approach

Article

Mar 2019
EXPERT SYST APPL

Investors utilise social media such as Twitter as a means of sharing news surrounding financials stocks listed on international stock exchanges. Company ticker symbols are used to uniquely identify companies listed on stock exchanges and can be embedded within tweets to create clickable hyperlinks referred to as cashtags, allowing investors to associate their tweets with specific companies. The main limitation is that identical ticker symbols are present on exchanges all over the world, and when searching for such cashtags on Twitter, a stream of tweets is returned which match any company in which the cashtag refers to - we refer to this as a cashtag collision. The presence of colliding cashtags could sow confusion for investors seeking news regarding a specific company. A resolution to this issue would benefit investors who rely on the speediness of tweets for financial information, saving them precious time. We propose a methodology to resolve this problem which combines Natural Language Processing and Data Fusion to construct company-specific corpora to aid in the detection and resolution of colliding cashtags, so that tweets can be classified as being related to a specific stock exchange or not. Supervised machine learning classifiers are trained twice on each tweet –once on a count vectorisation of the tweet text, and again with the assistance of features contained in the company-specific corpora. We validate the cashtag collision methodology by carrying out an experiment involving companies listed on the London Stock Exchange. Results show that several machine learning classifiers benefit from the use of the custom corpora, yielding higher classification accuracy in the prediction and resolution of colliding cashtags.

Big Data Fusion Model for Heterogeneous Financial Market Data (FinDf): Proceedings of the 2018 Intelligent Systems Conference (IntelliSys) Volume 1

Chapter

Jan 2019

The dawn of big data has seen the volume, variety, and velocity of data sources increase dramatically. Enormous amounts of structured, semi-structured and unstructured heterogeneous data can be garnered at a rapid rate, making analysis of such big data a herculean task. This has never been truer for data relating to financial stock markets, the biggest challenge being the 7Vs of big data which relate to the collection, pre-processing, storage and real-time processing of such huge quantities of disparate data sources. Data fusion techniques have been adopted in a wide number of fields to cope with such vast amounts of heterogeneous data from multiple sources and fuse them together in order to produce a more comprehensive view of the data and its underlying relationships. Research into the fusing of heterogeneous financial data is scant within the literature, with existing work only taking into consideration the fusing of text-based financial documents. The lack of integration between financial stock market data, social media comments, financial discussion board posts and broker agencies means that the benefits of data fusion are not being realised to their full potential. This paper proposes a novel data fusion model, inspired by the data fusion model introduced by the Joint Directors of Laboratories, for the fusing of disparate data sources relating to financial stocks. Data with a diverse set of features from different data sources will supplement each other in order to obtain a Smart Data Layer, which will assist in scenarios such as irregularity detection and prediction of stock prices. KeywordsBig dataData fusionHeterogeneous financial data

An Instance Transfer based Approach Using Enhanced Recurrent Neural Network for Domain Named Entity Recognition

Preprint

Oct 2018

Recently, neural networks have shown promising results for named entity recognition (NER), which needs a number of labeled data to for model training. When meeting a new domain (target domain) for NER, there is no or a few labeled data, which makes domain NER much more difficult. As NER has been researched for a long time, some similar domain already has well labelled data (source domain). Therefore, in this paper, we focus on domain NER by studying how to utilize the labelled data from such similar source domain for the new target domain. We design a kernel function based instance transfer strategy by getting similar labelled sentences from a source domain. Moreover, we propose an enhanced recurrent neural network (ERNN) by adding an additional layer that combines the source domain labelled data into traditional RNN structure. Comprehensive experiments are conducted on two datasets. The comparison results among HMM, CRF and RNN show that RNN performs bette than others. When there is no labelled data in domain target, compared to directly using the source domain labelled data without selecting transferred instances, our enhanced RNN approach gets improvement from 0.8052 to 0.9328 in terms of F1 measure.

Dual-View Learning for Detecting Web Query Intents

Article

Full-text available

Aug 2019

Automatically categorizing user intent behind web queries is a key issue not only for improving information retrieval tasks but also for designing tailored displays based on the underlying intention. In this article, a multiview learning method is proposed to recognize the user intent behind web searches.

Building taxonomy of web search intents for name entity queries

Conference Paper

Full-text available

Apr 2010

A significant portion of web search queries are name entity queries. The major search engines have been exploring various ways to provide better user experiences for name entity queries, such as showing "search tasks" (Bing search) and showing direct answers (Yahoo!, Kosmix). In order to provide the search tasks or direct answers that can satisfy most popular user intents, we need to capture these intents, together with relationships between them. In this paper we propose an approach for building a hierarchical taxonomy of the generic search intents for a class of name entities (e.g., musicians or cities). The proposed approach can find phrases representing generic intents from user queries, and organize these phrases into a tree, so that phrases indicating equivalent or similar meanings are on the same node, and the parent-child relationships of tree nodes represent the relationships between search intents and their sub-intents. Three different methods are proposed for tree building, which are based on directed maximum spanning tree, hierarchical agglomerative clustering, and pachinko allocation model. Our approaches are purely based on search logs, and do not utilize any existing taxonomies such as Wikipedia. With the evaluation by human judges (via Mechanical Turk), it is shown that our approaches can build trees of phrases that capture the relationships between important search intents.

Understanding user's query intent with wikipedia

Conference Paper

Full-text available

Apr 2009

Understanding the intent behind a user's query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user's intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results shows that our method significantly outperforms other methods in each intent domain.

A Survey of Named Entity Recognition and Classification

Article

Full-text available

Aug 2007

The term Named Entity, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Identifying references to these entities in text was recognized as one of the important sub-tasks of IE and was called Named Entity Recognition and Classification (NERC). Le terme « entité nommée », maintenant largement utilisé dans le cadre du traitement des langues naturelles, a été adopté pour la Sixth Message Understanding Conference (MUC 6) (R. Grishman et Sundheim, 1996). À cette époque, la Conférence était concentrée sur les tâches d'extraction d'information (EI), dans lesquelles l'information structurée relative aux activités des entreprises et aux activités liées à la défense sont extraites de texte non structuré, comme les articles de journaux. Au moment de définir cette tâche, on a remarqué qu'il est essentiel de reconnaître les unités d'information comme les noms (dont les noms de personnes, d'organisations et de lieux géographiques) et les expressions numériques, notamment l'expression de l'heure, de la date, des sommes monétaires et des pourcentages. On a alors conclu que l'identification des références à ces entités dans le texte était une des principales sous-tâches de l'EI et on a alors nommé cette tâche Named Entity Recognition and Classification (NERC) (reconnaissance et classification d'entités nommées).

Extraction and Evaluation of Candidate Named Entities in Search Engine Queries

Conference Paper

Nov 2012

Named Entity Recognition (NER) has recently been applied to search queries, in order to better understand their semantics. We present a novel method for detecting candidate named entities (NEs) using grammar annotation and query segmentation with the aid of top-n snippets from search engine results, and a web n-gram model to accurately identify NE boundaries. We then evaluate this method automatically using DBpedia as a rich data source of NEs, with the aid of a small representative random sample that is manually annotated. Finally, an analysis of the types of named entities that often occur in a query log is conducted, from which a search query driven named entity taxonomy is presented.

Detecting candidate named entities in search queries

Article

Aug 2012

The information extraction task of Named Entities Recognition (NER) has been recently applied to search engine queries, in order to better understand their semantics. Here we concentrate on the task prior to the classification of the named entities (NEs) into a set of categories, which is the problem of detecting candidate NEs via the subtask of query segmentation.We present a novel method for detecting candidate NEs using grammar annotation and query segmentation with the aid of top-n snippets from search engine results and a web n-gram model, to accurately identify NE boundaries. The proposed method addresses the problem of accurately setting boundaries of NEs and the detection of multiple NEs in queries.

Weakly-supervised discovery of named entities using web search queries

Conference Paper

Nov 2007

Marius Pasca

A seed-based framework for textual information extraction allows for weakly supervised extraction of named entities from anonymized Web search queries. The extraction is guided by a small set of seed named entities, without any need for handcrafted extraction patterns or domain-specific knowledge, allowing for the acquisition of named entities pertaining to various classes of interest to Web search users. Inherently noisy search queries are shown to be a highly valuable, albeit little explored, resource for Web-based named entity discovery.

Using search session context for named entity recognition in query

Conference Paper

Jul 2010

Recently, the problem of Named Entity Recognition in Query (NERQ) is attracting increasingly attention in the field of information retrieval. However, the lack of context information in short queries makes some classical named entity recognition (NER) algorithms fail. In this paper, we propose to utilize the search session information before a query as its context to address this limitation. We propose to improve two classical NER solutions by utilizing the search session context, which are known as Conditional Random Field (CRF) based solution and Topic Model based solution respectively. In both approaches, the relationship between current focused query and previous queries in the same session are used to extract novel context aware features. Experimental results on real user search session data show that the NERQ algorithms using search session context performs significantly better than the algorithms using only information of the short queries.

Named Entity Recognition in Query

Conference Paper

Jul 2009

This paper addresses the problem of Named Entity Recog- nition in Query (NERQ), which involves detection of the named entity in a given query and classification of the named entity into predefined classes. NERQ is potentially useful in many applications in web search. The paper proposes tak- ing a probabilistic approach to the task using query log data and Latent Dirichlet Allocation. We consider contexts of a named entity (i.e., the remainders of queries after the named entity is removed) as words of a document, and classes of the named entity as topics. The topic model is constructed by a novel and general learning method referred to as WS-LDA (Weakly Supervised Latent Dirichlet Allocation), which em- ploys weakly supervised learning (rather than unsupervised learning) using partially labeled seed entities. Experimental results show that the proposed method based on WS-LDA can accurately perform NERQ, and outperform the baseline methods.

Open Entity Extraction from Web Search Query Logs.

Conference Paper

Jan 2010

In this paper we propose a completely unsupervised method for open-domain entity extraction and clustering over query logs. The underlying hypothesis is that classes defined by mining search user activity may significantly differ from those typically considered over web documents, in that they better model the user space, i.e. users' perception and interests. We show that our method outperforms state of the art (semi-)supervised systems based either on web documents or on query logs (16% gain on the clustering task). We also report evidence that our method successfully supports a real world application, namely keyword generation for sponsored search.

Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

Article

Jul 2003

We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.

A Two-Step Named Entity Recognizer for Open-Domain Search Queries

Abstract and Figures

Recommended publications

Design and implementation of DBC based on dynamic AOP

Contour curve descriptor based on affine invariance

Person organization with a memory set: are spontaneous trait inferences personality characterization...

Hosting or Colocation Data Centers