Conference PaperPDF Available

A Two-Step Named Entity Recognizer for Open-Domain Search Queries

Authors:
  • Universidad Tecnica Metropolitana

Abstract and Figures

Named entity recognition in queries is the task of identifying sequences of terms in search queries that refer to a unique con-cept. This problem is catching increas-ing attention, since the lack of context in short queries makes this task difficult for full-text off-the-shelf named entity recog-nizers. In this paper, we propose to deal with this problem in a two-step fashion. The first step classifies each query term as token or part of a named entity. The sec-ond step takes advantage of these binary labels for categorizing query terms into a pre-defined set of 28 named entity classes. Our results show that our two-step strategy is promising by outperforming a one-step traditional baseline by more than 10%.
Content may be subject to copyright.
International Joint Conference on Natural Language Processing, pages 829–833,
Nagoya, Japan, 14-18 October 2013.
A Two-Step Named Entity Recognizer for Open-Domain Search Queries
Andreas Eiselt
Yahoo! Research Latin America
Av. Blanco Encalada 2120,
Santiago, Chile
eiselt@yahoo-inc.com
Alejandro Figueroa
Yahoo! Research Latin America
Av. Blanco Encalada 2120,
Santiago, Chile
afiguero@yahoo-inc.com
Abstract
Named entity recognition in queries is the
task of identifying sequences of terms in
search queries that refer to a unique con-
cept. This problem is catching increas-
ing attention, since the lack of context in
short queries makes this task difficult for
full-text off-the-shelf named entity recog-
nizers. In this paper, we propose to deal
with this problem in a two-step fashion.
The first step classifies each query term as
token or part of a named entity. The sec-
ond step takes advantage of these binary
labels for categorizing query terms into a
pre-defined set of 28 named entity classes.
Our results show that our two-step strategy
is promising by outperforming a one-step
traditional baseline by more than 10%.
1 Introduction
Search engines are key players in serving as in-
terface between users and web resources. Hence,
they started to take on the challenge of modelling
user interests and enhance their search experience.
This is one of the main drivers of replacing the
classical document-keyword matching, a.k.a. bag-
of-word approach, with user-oriented strategies.
Specifically, these changes are geared towards im-
proving the precision, contextualization, and per-
sonalization of the search results. To achieve this,
it is vital to identify fundamental structures such as
named entities (e.g., persons, locations and organi-
zations) (Hu et al., 2009). Indeed, previous studies
indicate that over 70% of all queries contain enti-
ties (Guo et al., 2009; Yin and Shah, 2010).
Search queries are on average composed of 2-
3 words, yielding few context and breaking the
grammatical rules of natural language (Guo et al.,
2009; Du et al., 2010). Thus, named entity recog-
nizers for relatively lengthy grammatically well-
formed documents perform poorly on the task of
Named Entity Recognition in Queries (NERQ).
At heart, the contribution of this work is a novel
supervised approach to NERQ, trained with a large
set of manually tagged queries and consisting of
two steps: 1) performs a binary classification,
where each query term is tagged as token/entity
depending on whether or not it is part of a named
entity; and 2) takes advantage of these binary to-
ken/entity labels for categorizing each term within
the query into one of a pre-defined set of classes.
2 Related Work
To the best of our knowledge, there have been a
few previous research efforts attempting to recog-
nize named entities in search queries. This prob-
lem is relatively new and it was first introduced by
(Pas¸ca, 2007). Their weakly supervised method
starts with an input class represented by a set of
seeds, which are used to induce typical query-
contexts for the respective input category. Con-
texts are then used to acquire and select new can-
didate instances for the corresponding class.
In their pioneer work, (Guo et al., 2009) fo-
cused on queries that contain only one named en-
tity belonging to four classes (i.e., movie, game,
book and song). As for learning approach, they
employed weakly supervised topic models using
partially labeled seed named entities. These topic
models were trained using query log data corre-
sponding to 120 seed named entities (another 60
for testing) selected from three target web sites.
Later, (Jain and Pennacchiotti, 2010) extended this
approach to a completely unsupervised and class-
independent method.
In another study, (Du et al., 2010) tackled
the lack of context in short queries by interpret-
ing query sequences in the same search session
as extra contextual information. They capital-
ized on a collection of 6,000 sessions containing
only queries targeted at the car model domain.
829
They trained Conditional Random Field (CRF)
and topic models, showing that using search
sessions improves the performance significantly.
More recent, (Alasiry et al., 2012a; Alasiry et al.,
2012b) determined named entity boundaries, com-
bining grammar annotation, query segmentation,
top ranked snippets from search engine results in
conjunction with a web n-gram model.
In contrast, we do not profit from seed named
entities nor web search results, but rather from
a large manually annotated collection of about
80,000 open-domain queries. We consider search
queries containing multiple named entities, and
we do not benefit from search sessions. Further-
more, our approach performs two labelling steps
instead of a straightforward one-step labelling.
The first step checks if each query term is part of a
named entity or not, while the second assigns each
term to one out of a set of 291classes by taking
into account the outcome of the first step.
3 NERQ-2S
NERQ-2S is a two-step named entity recognizer
for open-domain search queries. First, it differ-
entiates named entity terms from other types of
tokens (e.g., word and numbers) on the basis of
a CRF2trained with manually annotated data. In
the second step, NERQ-2S incorporates the out-
put of this CRF into a new CRF as a feature. This
second CRF assigns each term within the query to
one out of 29 pre-defined categories. In essence,
considering these automatically computed binary
entity/token labels seeks to influence the second
model so that the overall performance is improved.
Given the fact that binary entity/token tags are
only used as additional contextual evidence by the
second CRF, these labels can be reverted in the
second step. NERQ-2S identifies 28 named entity
classes that are prominent in search engine open-
domain queries (see table 1). This set of categories
was deliberately chosen as a means of enriching
search results regarding general user interests, and
thus aimed at providing a substantially better over-
all user experience. In particular, named entities
are normally utilized for devising the lay-out and
the content of the result page of a search engine.
1In actuality, we considered 29 classes: 28 regards named
entities and one class for non-entity (token). For the sake of
readability, from now on, we say indistinctly that the second
step identifies 28 named entity classes or 29 classes.
2CRFsuite: http://www.chokkan.org/software/crfsuite
At both steps, NERQ-2S uses a CRF as classi-
fier and a set of properties, which was determined
separately for each classifier by executing a greedy
feature selection algorithm (see next section). For
both CRFs, this algorithm contemplated as candi-
dates the 24 attributes explained in table 2. Ad-
ditionally, in the case of the second CRF, this al-
gorithm took into account the entity/token feature
produced by the first CRF. Note that features in
table 2 are well-known from other named entity
recognition systems (Nadeau and Sekine, 2007).
4 Experiments
In all our experiments, we carried out a 10-fold
cross-validation. As for data-sets, we bene-
fited from a collection comprising 82,413 queries,
which are composed of 242,723 terms3. These
queries were randomly extracted from the query
log of a commercial search engine, and they are
exclusively in English. In order to annotate our
query collection, these queries were first tok-
enized, and then each term was manually tagged
by an editorial team using the schema adopted in
(Tjong Kim Sang and De Meulder, 2003).
Attributes were selected by exploiting a greedy
algorithm. This procedure starts with an empty
bag of properties and after each iteration adds the
one that performs the best. In order to determine
this feature, this procedure tests each non-selected
attribute together with all the properties in the bag.
The algorithm stops when there is no non-selected
feature that enhances the performance.
11 10 2 1 0 4 13 3 7 17 19 21 12
0,50
0,55
0,60
0,65
0,70
0,519
0,561
0,599
0,621
0,634
0,642
0,648
0,652
0,656
0,657
0,657
0,658
0,659
Feature ID
F(1)-Score
Figure 1: Attributes selected by the greedy algo-
rithm and their respective contribution (baseline).
See also table 2 for id-feature mappings.
As for a baseline, we used a traditional one-
step approach grounded on CRF enriched with 13
3Due to privacy laws, query logs cannot be made public.
830
ID Name Example ID Name Example
0 Airline Code AA, LA, JJ 15 Food Sushi, Bread, Dessert
1 Beverage Cocktails, Beer 16 Food Ingredient Honey, Avocado
2 Brand Name Bacardi, Apple 17 Food Taste Sweet, Cheesy
3 Business Hotel, Newspaper 18 Horoscope Sign Libra, Taurus
4 Cooking Method Pressure Cooking 19 Measurement Name Inches, Kilogram
5 Cuisine Mexican, German 20 Media Title Age of Empires 2
6 Currency Name Dollar, Euros, Pesos 21 Occasion Festival, Ceremony
7 Diet Vegan, Fat free 22 Organization Name Yahoo, Caf Soleil
8 Disease and Condition Cancer, Diabetic 23 Person Name Marry Poppins
9 Dish Ratatouille, Tiramisu 24 Phone Number 3153423595
10 Domain forbes.com, lan.com 25 Place Name Chile, Berlin
11 Drink Bloody Mary, Sangria 26 Product Camera, Cell phone
12 Email Address john.doe@example.com 27 Treatment Steroids, Surgery
13 Event Name Christmas, Super Bowl 28 Token (no NE-class) how, to, image
14 File Name msimn.exe, .htaccess
Table 1: Named entity classes recognized by NERQ-2S.
out of our 24 features (see table 2), which were
chosen by running our greedy feature selection al-
gorithm. Figure 1 shows the order that these 13
features were chosen, and their respective impact
on the performance. Regarding these results, it is
worth highlighting the following findings:
1. The first feature selected by the greedy algo-
rithm models each term by its non-numerical
characters (id=11 in table 2). This attribute
helps to correctly tag 80.42% of the terms
when they are modified (numbers removed).
2. The third chosen feature considers the value
of the following word, when tagging a term
(id=2 in table 2). This attribute helps to cor-
rectly annotate 79.68%, 74.55% and 74.87%
of tokens belonging to person, place and or-
ganization names, respectively.
3. Our figures also point out to the relevance of
the three word features (id=0,1,2 in table 2).
These features were selected in a row, boost-
ing the performance from F(1) = 0.561 to
F(1) = 0.634, a 13.01% increase with re-
spect to the previously selected properties.
In summary, the performance of the one-step
baseline is F(1) = 0.659. In contrast, figure
2 highlights the 16 out of the 25 features uti-
lized by the second phase of NERQ-2S. Note that
the “new” bar indicates the token/entity attribute
determined in the first step. Most importantly,
NERQ-2S finished with an F(1) = 0.729, which
means a 10.62% enhancement with respect to the
one-step baseline. From these results, it is worth
considering the following aspects:
1. In terms of features, 11 of the 13 attributes
used by the one-step baseline were also
exploited by NERQ-2S. Further, NERQ-2S
profits from four additional properties that
were also available for the one-step baseline.
2. The five more prominent properties selected
by the baseline, were also chosen by NERQ-
2S with just a slight change in order.
3. The “new” feature achieves an improvement
of 23.51% (F(1) = 0.641) with respect to
the previous selected property. The impact
of the entity/token attribute can be measure
when compared with the performance ac-
complished by the first five features selected
by the baseline (F(1) = 0.634).
In light of these results, we can conclude that: a)
adding the entity/token feature to the CRF is vital
for boosting the performance, making a two-step
approach a better solution than the traditional one-
step approach; and b) this entity/token property is
complementary to the list shown in table 2.
The confusion matrix for NERQ-2S shows that
errors, basically, regard highly ambiguous terms.
Some interesting misclassifications:
1. Overall, 17.38% of the terms belonging to
place names were mistagged by NERQ-2S.
From these, 72.11% were perceived as part
of organization names.
2. On the other hand, 17.27% of the terms cor-
responding to organization names were mis-
labelled by NERQ-2S. Here, 15.52% and
12.84% of these errors were due to the fact
that these terms were seen as tokens and parts
of place names, respectively.
831
ID Feature Example
Word Features
0 Current term (ti) abc123
1 Previous term (ti1) before
2 Next word (ti+1 ) after
N-grams
3 Bi-gram of ti1and tibefore abc123
4 Bi-gram of tiand ti+1 abc123 after
Pre- & Postfix
5 1 leftmost character from tia
6 2 leftmost characters from tiab
7 3 leftmost characters from tiabc
8 1 rightmost character from ti3
9 2 rightmost characters from ti23
10 3 rightmost characters from ti123
Reductions
11 tiwithout digits abc
12 tiwithout letters 123
Word Shape
13 Shape of ti(“a” represents letters; aaa000
“0” digits, “-” special characters)
14 Shape of ti(same elements joined) a0
Position & Lengths
15 Position of tifrom left 3
16 Position of tifrom right 2
17 Character length of ti6
Boolean
18 tiis a number? (only digits) false
19 tiis a word? (only letters) false
20 tiis a mixture of letters and digits? true
21 ticontains “.”? false
22 ticontains apostrophe? false
23 ticontains other special characters? false
Table 2: List of used features. Examples are for
the third term of query “first before abc123 after”.
11new10 2 0 1 7 4 13 3 21 14 8 16 19 18
0,50
0,55
0,60
0,65
0,70
0,75
0,80
0,519
0,641
0,682
0,698
0,708
0,716
0,720
0,723
0,727
0,728
0,728
0,728
0,728
0,728
0,728
0,729
Feature ID
F(1)-Score
Figure 2: Attributes selected by the greedy al-
gorithm and their respective contribution (NERQ-
2S). See also table 2 for id-feature mappings. The
word “new” denotes the binary token/entity at-
tribute determined in the first step.
Incidentally, NERQ-2S mislabelled 10.40% of
the tokens (non-named entity terms), while the
one-step baseline 17.57%. This difference signals
the importance of first-step consisting of an spe-
cialized and efficient token/entity term annotator.
With regard to the first step of NERQ-2S, nine out
of the 24 properties were useful, and the first step
finished with an F(1) = 0.8077. From these nine
attributes, eight correspond to the top eight fea-
tures used by our one-step baseline, and one extra
attribute (id=20). Thus, the discriminative proba-
bilistic model learned in this first step is more spe-
cialized for this task. That is to say, though the
context of a term might be modelled similarly, the
parameters of the CRF model are different.
The confusion matrix for this binary classifier
shows that 11.44% of entity terms were mistagged
as token, while 22.24% of tokens as entity terms.
This means a higher percentage of errors comes
from mislabelled tokens.
On a final note, as a means of quantifying the
impact of the first step on NERQ-2S, we replaced
the output given by the first CRF model with the
manual binary token/annotations given by the edi-
torial team. In other words, the “new” feature is
now a manual input instead of an automatically
computed property. By doing this, NERQ-2S in-
creases the performance from F(1) = 0.729 to
F(1) = 0.809, which means 10.97% better than
NERQ-2S and 22.76% than the one-step base-
line. This corroborates that a two-step approach
to NERQ is promising.
5 Conclusions and Further Work
This paper presents NERQ-2S, a two-step ap-
proach to the problem of recognizing named enti-
ties in search queries. In the first stage, NERQ-2S
checks as to whether or not each query term be-
longs to a named entity, and in the second phase,
it categorizes each token according to a set of pre-
defined classes. These classes are aimed at en-
hancing the user experience with the search engine
in contrast to previous pre-defined categories.
Our results indicate that our two-step approach
outperforms the typical one-step NERQ. Since our
error analysis indicates that there is about 11% of
potential global improvement by boosting the per-
formance of the entity/token tagger, one research
direction regards combining the output of distinct
two-sided classifiers for improving the overall per-
formance of NERQ-2S.
832
References
Areej Alasiry, Mark Levene, and Alexandra Poulovas-
silis. 2012a. Detecting candidate named entities in
search queries. In SIGIR, pages 1049–1050.
Areej Alasiry, Mark Levene, and Alexandra Poulovas-
silis. 2012b. Extraction and evaluation of candidate
named entities in search engine queries. In WISE,
pages 483–496.
Junwu Du, Zhimin Zhang, Jun Yan, Yan Cui, and
Zheng Chen. 2010. Using search session context for
named entity recognition in query. In Proceeding
of the 33rd international ACM SIGIR conference on
Research and development in information retrieval -
SIGIR ’10.
Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. 2009.
Named entity recognition in query. In Proceedings
of the 32nd international ACM SIGIR conference on
Research and development in information retrieval -
SIGIR ’09, page 267, New York, New York, USA.
ACM Press.
Jian Hu, Gang Wang, Fred Lochovsky, Jian-Tao Sun,
and Zheng Chen. 2009. Understanding users query
intent with Wikipedia. In Proceedings of WWW-09.
A. Jain and Marco Pennacchiotti. 2010. Open entity
extraction from web search query logs. In Proceed-
ings of the 23rd International Conference on Com-
putational Linguistics, pages 510–518.
David Nadeau and Satoshi Sekine. 2007. A survey
of named entity recognition and classification. Lin-
guisticae Investigationes, 30(1):3–26, January. Pub-
lisher: John Benjamins Publishing Company.
Marius Pas¸ca. 2007. Weakly-supervised discovery of
named entities using web search queries. In Pro-
ceedings of the sixteenth ACM conference on Con-
ference on information and knowledge management
- CIKM ’07, page 683, New York, New York, USA.
ACM Press.
Erik F. Tjong Kim Sang and Fien De Meulder.
2003. Introduction to the CoNLL-2003 shared task:
language-independent named entity recognition. In
Proceedings of the seventh conference on Natural
language learning at HLT-NAACL 2003 - Volume
4, CONLL ’03, pages 142–147, Stroudsburg, PA,
USA. Association for Computational Linguistics.
Xiaoxin Yin and Sarthak Shah. 2010. Building Taxon-
omy of Web Search Intents for Name Entity Queries.
In Proceedings of WWW-2010.
833
... Most research concerning entity-based query understanding focuses on NamedEntity Recognition [41] (the task of finding what terms are mentions of NamedEntities, without linking them to the entity), possibly associated to query intent discovery [32] or query classification into pre-defined classes [16,21,35]. Some work has also focused on linguistic analysis of queries, for example, by POS tagging terms or tagging them with a limited number of classes and other linguistic structures [1,2], or assigning a coarse-grained purpose to each segment [30]. ...
... As a source of information, these works may either use knowledge bases or information derived from web search such as query logs (see, e.g., Reference [28]), click through information [33], search sessions [14], top-k snippets from search engines [1], web phrase DBs [2,23], or large manually annotated collections of open-domain queries to extract robust frequency or mutualinformation features and contexts [16]. ...
... Entities drawn from Source 3, our largest source of candidates, are associated with a set of features (9)(10)(11)(12)(13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24) relative to the process of snippet annotation performed by WAT. Feature freq (how many snippets mention the entity) is an obvious indicator of an entity's correctness. ...
Article
We study the problem of linking the terms of a web-search query to a semantic representation given by the set of entities (a.k.a. concepts) mentioned in it. We introduce SMAPH, a system that performs this task using the information coming from a web search engine, an approach we call “piggybacking.” We employ search engines to alleviate the noise and irregularities that characterize the language of queries. Snippets returned as search results also provide a context for the query that makes it easier to disambiguate the meaning of the query. From the search results, SMAPH builds a set of candidate entities with high coverage. This set is filtered by linking back the candidate entities to the terms occurring in the input query, ensuring high precision. A greedy disambiguation algorithm performs this filtering; it maximizes the coherence of the solution by iteratively discovering the pertinent entities mentioned in the query. We propose three versions of SMAPH that outperform state-of-the-art solutions on the known benchmarks and on the GERDAQ dataset, a novel dataset that we have built specifically for this problem via crowd-sourcing and that we make publicly available.
... There is little prior work on entity annotators for queries, mostly concerned with the detection of NEs [30], possibly associated to intent [25] or pre-defined classes [27,16,13], POS tagging or tagging with a limited number of classes and other linguistic structures [1,2], abbreviation disambiguation [41], assigning a coarse-grained purpose to each segment [23]. Some of them operate just on the query text (see e.g., [21]), others use query logs [1], click through information [26], search sessions [11], top-k snippets and web phrase DBs [2,18], and large manually annotated collections of open-domain queries to extract robust frequency or mutual-information features and contexts [13]. ...
... There is little prior work on entity annotators for queries, mostly concerned with the detection of NEs [30], possibly associated to intent [25] or pre-defined classes [27,16,13], POS tagging or tagging with a limited number of classes and other linguistic structures [1,2], abbreviation disambiguation [41], assigning a coarse-grained purpose to each segment [23]. Some of them operate just on the query text (see e.g., [21]), others use query logs [1], click through information [26], search sessions [11], top-k snippets and web phrase DBs [2,18], and large manually annotated collections of open-domain queries to extract robust frequency or mutual-information features and contexts [13]. ...
Conference Paper
In this paper we study the problem of linking open-domain web-search queries towards entities drawn from the full entity inventory of Wikipedia articles. We introduce SMAPH-2, a second-order approach that, by piggybacking on a web search engine, alleviates the noise and irregularities that characterize the language of queries and puts queries in a larger context in which it is easier to make sense of them. The key algorithmic idea underlying SMAPH-2 is to first discover a candidate set of entities and then link-back those entities to their mentions occurring in the input query. This allows us to confine the possible concepts pertinent to the query to only the ones really mentioned in it. The link-back is implemented via a collective disambiguation step based upon a supervised ranking model that makes one joint prediction for the annotation of the complete query optimizing directly the F1 measure. We evaluate both known features, such as word embeddings and semantic relatedness among entities, and several novel features such as an approximate distance between mentions and entities (which can handle spelling errors). We demonstrate that SMAPH-2 achieves state-of-the-art performance on the ERD@SIGIR2014 benchmark. We also publish GERDAQ (General Entity Recognition, Disambiguation and Annotation in Queries), a novel, public dataset built specifically for web-query entity linking via a crowdsourcing effort. SMAPH-2 outperforms the benchmarks by comparable margins also on GERDAQ.
... A recognition method with high quality can directly improve the follow-up processing results of Web data management products. NER research shows great successes in various domains and becomes a hot topic Eiselt and Figueroa, 2013), such as social media (Vavliakis et al., 2013;Yao and Sun,2016), language texts (Karaa and Slimani, 2017),biomedicine (Song et al., 2016;Amith et al.,2017), and so on. As we know, texts of different domains may vary from features, writing styles and structures. ...
Article
Full-text available
Recently, neural networks have shown promising results for named entity recognition(NER), which needs a number of labeled data to for model training. When meeting a new domain (target domain) for NER, there is no or a few labeled data, which makes domain NER much more difficult. As NER has been researched for a long time, some similar domain already has well labeled data(source domain). Therefore, in this paper, we focus on domain NER by studying how to utilize the labeled data from such similar source domain for the new target domain. We design a kernel function based instance transfer strategy by getting similar labeled sentences from a source domain. Moreover, we propose an enhanced recurrent neural network (ERNN) by adding an additional layer that combines the source domain labeled data into traditional RNN structure. Comprehensive experiments are conducted on two datasets. The comparison results among HMM, CRF and RNN show that RNN performs better than others. When there is no labeled data in domain target, compared to directly using the source domain labeled data without selecting transferred instances, our enhanced RNN approach gets improvement from 0.8052 to 0.9328 in terms of F1 measure.
... Yahoo! Research proposed a two-step process using Conditional Random Fields (CRFs) to first identify named entities in queries and then assign them to one out of 29 predefined categories [19]. A CRF-based approach is also proposed by Expedia to find named entities in travel-related search queries [14]. ...
Article
Search engines are still the most common way of finding information on the Web. However, they are largely unable to provide satisfactory answers to time- and location-specific queries. Such queries can best and often only be answered by humans that are currently on-site. Although online platforms for community question answering are very popular, very few exceptions consider the notion of users’ current physical locations. In this article, we present CloseUp, our prototype for the seamless integration of community-driven live search into a Google-like search experience. Our efforts focus on overcoming the defining differences between traditional Web search and community question answering, namely the formulation of search requests (keyword-based queries vs. well-formed questions) and the expected response times (milliseconds vs. minutes/hours). To this end, the system features a deep learning pipeline to analyze submitted queries and translate relevant queries into questions. Searching users can submit suggested questions to a community of mobile users. CloseUp provides a stand-alone mobile application for submitting, browsing, and replying to questions. Replies from mobile users are presented as live results in the search interface. Using a field study, we evaluated the feasibility and practicability of our approach.
... They trained CRF models [9], showing that using context from search sessions improves the performance significantly. More recently, authors in [11] proposed NERQ-2S I, a two-step named entity recognizer for open-domain search queries. The first step classifies each query term as token or part of a named entity based on a CRF. ...
Chapter
Full-text available
Semantic understanding of web queries is a challenging problem as web queries are short, noisy and usually do not observe the grammar of a written language. In this paper, we specifically study the user web search queries with local intent on Bing. Local intent queries deal with searching for local businesses and services in a location. Hence, local query parsing translates into the classical problem of Named Entity Recognition (NER) in NLP. State-of-the-art NER systems rely heavily on hand-crafted features and domain-specific knowledge to effectively learn from the small, supervised training corpora that is available. In this paper, we use deep learnt neural model that relies solely on features extracted from word embeddings learnt in an unsupervised way, using search logs. We propose a novel technique for generating domain specific embeddings and show that they significantly improve the performance of existing models for the NER task. Our model outperforms the existing CRF based parser currently used in production.
... NER has been recognized as a fundamental problem in query processing by Guo et al. (2009), and many works since (e.g. (Alasiry et al., 2012;Eiselt and Figueroa, 2013;Zhai et al., 2016)) explored various models and features for the task. Differently from those works, our goal is to design a BiLSTM model that can be easily integrated with modern BiLSTM parsers. ...
... The lack of context in short queries (i.e. tweets), due to the character restriction, makes the task of recognising entities particularly difficult for full-text offthe-shelf Named Entity Recognition (NER) ( Eiselt & Figueroa, 2013 ). We have utilised NER by selecting the 20 most frequent proper nouns from each of the FDB company sub-forums. ...
Article
Investors utilise social media such as Twitter as a means of sharing news surrounding financials stocks listed on international stock exchanges. Company ticker symbols are used to uniquely identify companies listed on stock exchanges and can be embedded within tweets to create clickable hyperlinks referred to as cashtags, allowing investors to associate their tweets with specific companies. The main limitation is that identical ticker symbols are present on exchanges all over the world, and when searching for such cashtags on Twitter, a stream of tweets is returned which match any company in which the cashtag refers to - we refer to this as a cashtag collision. The presence of colliding cashtags could sow confusion for investors seeking news regarding a specific company. A resolution to this issue would benefit investors who rely on the speediness of tweets for financial information, saving them precious time. We propose a methodology to resolve this problem which combines Natural Language Processing and Data Fusion to construct company-specific corpora to aid in the detection and resolution of colliding cashtags, so that tweets can be classified as being related to a specific stock exchange or not. Supervised machine learning classifiers are trained twice on each tweet –once on a count vectorisation of the tweet text, and again with the assistance of features contained in the company-specific corpora. We validate the cashtag collision methodology by carrying out an experiment involving companies listed on the London Stock Exchange. Results show that several machine learning classifiers benefit from the use of the custom corpora, yielding higher classification accuracy in the prediction and resolution of colliding cashtags.
... The lack of context in short queries (i.e. tweets), due to the character restriction, makes the task of recognising entities particularly difficult for full-text offthe-shelf Named Entity Recognition (NER) ( Eiselt & Figueroa, 2013 ). We have utilised NER by selecting the 20 most frequent proper nouns from each of the FDB company sub-forums. ...
Chapter
The dawn of big data has seen the volume, variety, and velocity of data sources increase dramatically. Enormous amounts of structured, semi-structured and unstructured heterogeneous data can be garnered at a rapid rate, making analysis of such big data a herculean task. This has never been truer for data relating to financial stock markets, the biggest challenge being the 7Vs of big data which relate to the collection, pre-processing, storage and real-time processing of such huge quantities of disparate data sources. Data fusion techniques have been adopted in a wide number of fields to cope with such vast amounts of heterogeneous data from multiple sources and fuse them together in order to produce a more comprehensive view of the data and its underlying relationships. Research into the fusing of heterogeneous financial data is scant within the literature, with existing work only taking into consideration the fusing of text-based financial documents. The lack of integration between financial stock market data, social media comments, financial discussion board posts and broker agencies means that the benefits of data fusion are not being realised to their full potential. This paper proposes a novel data fusion model, inspired by the data fusion model introduced by the Joint Directors of Laboratories, for the fusing of disparate data sources relating to financial stocks. Data with a diverse set of features from different data sources will supplement each other in order to obtain a Smart Data Layer, which will assist in scenarios such as irregularity detection and prediction of stock prices. KeywordsBig dataData fusionHeterogeneous financial data
... NER research shows great successes in various domains and becomes a hot topic Eiselt and Figueroa, 2013), such as social media (Vavliakis et al., 2013;Yao and Sun, 2016), language texts (Karaa and Slimani, 2017), biomedicine Amith et al., 2017), and so on. As we know, texts of different domains may vary from features, writing styles and structures. ...
Preprint
Recently, neural networks have shown promising results for named entity recognition (NER), which needs a number of labeled data to for model training. When meeting a new domain (target domain) for NER, there is no or a few labeled data, which makes domain NER much more difficult. As NER has been researched for a long time, some similar domain already has well labelled data (source domain). Therefore, in this paper, we focus on domain NER by studying how to utilize the labelled data from such similar source domain for the new target domain. We design a kernel function based instance transfer strategy by getting similar labelled sentences from a source domain. Moreover, we propose an enhanced recurrent neural network (ERNN) by adding an additional layer that combines the source domain labelled data into traditional RNN structure. Comprehensive experiments are conducted on two datasets. The comparison results among HMM, CRF and RNN show that RNN performs bette than others. When there is no labelled data in domain target, compared to directly using the source domain labelled data without selecting transferred instances, our enhanced RNN approach gets improvement from 0.8052 to 0.9328 in terms of F1 measure.
Article
Full-text available
Automatically categorizing user intent behind web queries is a key issue not only for improving information retrieval tasks but also for designing tailored displays based on the underlying intention. In this article, a multiview learning method is proposed to recognize the user intent behind web searches.
Conference Paper
Full-text available
A significant portion of web search queries are name entity queries. The major search engines have been exploring various ways to provide better user experiences for name entity queries, such as showing "search tasks" (Bing search) and showing direct answers (Yahoo!, Kosmix). In order to provide the search tasks or direct answers that can satisfy most popular user intents, we need to capture these intents, together with relationships between them. In this paper we propose an approach for building a hierarchical taxonomy of the generic search intents for a class of name entities (e.g., musicians or cities). The proposed approach can find phrases representing generic intents from user queries, and organize these phrases into a tree, so that phrases indicating equivalent or similar meanings are on the same node, and the parent-child relationships of tree nodes represent the relationships between search intents and their sub-intents. Three different methods are proposed for tree building, which are based on directed maximum spanning tree, hierarchical agglomerative clustering, and pachinko allocation model. Our approaches are purely based on search logs, and do not utilize any existing taxonomies such as Wikipedia. With the evaluation by human judges (via Mechanical Turk), it is shown that our approaches can build trees of phrases that capture the relationships between important search intents.
Conference Paper
Full-text available
Understanding the intent behind a user's query can help search engine to automatically route the query to some corresponding vertical search engines to obtain particularly relevant contents, thus, greatly improving user satisfaction. There are three major challenges to the query intent classification problem: (1) Intent representation; (2) Domain coverage and (3) Semantic interpretation. Current approaches to predict the user's intent mainly utilize machine learning techniques. However, it is difficult and often requires many human efforts to meet all these challenges by the statistical machine learning approaches. In this paper, we propose a general methodology to the problem of query intent classification. With very little human effort, our method can discover large quantities of intent concepts by leveraging Wikipedia, one of the best human knowledge base. The Wikipedia concepts are used as the intent representation space, thus, each intent domain is represented as a set of Wikipedia articles and categories. The intent of any input query is identified through mapping the query into the Wikipedia representation space. Compared with previous approaches, our proposed method can achieve much better coverage to classify queries in an intent domain even through the number of seed intent examples is very small. Moreover, the method is very general and can be easily applied to various intent domains. We demonstrate the effectiveness of this method in three different applications, i.e., travel, job, and person name. In each of the three cases, only a couple of seed intent queries are provided. We perform the quantitative evaluations in comparison with two baseline methods, and the experimental results shows that our method significantly outperforms other methods in each intent domain.
Article
Full-text available
The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Identifying references to these entities in text was recognized as one of the important sub-tasks of IE and was called “Named Entity Recognition and Classification (NERC)”. Le terme « entité nommée », maintenant largement utilisé dans le cadre du traitement des langues naturelles, a été adopté pour la Sixth Message Understanding Conference (MUC 6) (R. Grishman et Sundheim, 1996). À cette époque, la Conférence était concentrée sur les tâches d'extraction d'information (EI), dans lesquelles l'information structurée relative aux activités des entreprises et aux activités liées à la défense sont extraites de texte non structuré, comme les articles de journaux. Au moment de définir cette tâche, on a remarqué qu'il est essentiel de reconnaître les unités d'information comme les noms (dont les noms de personnes, d'organisations et de lieux géographiques) et les expressions numériques, notamment l'expression de l'heure, de la date, des sommes monétaires et des pourcentages. On a alors conclu que l'identification des références à ces entités dans le texte était une des principales sous-tâches de l'EI et on a alors nommé cette tâche Named Entity Recognition and Classification (NERC) (reconnaissance et classification d'entités nommées).
Conference Paper
Named Entity Recognition (NER) has recently been applied to search queries, in order to better understand their semantics. We present a novel method for detecting candidate named entities (NEs) using grammar annotation and query segmentation with the aid of top-n snippets from search engine results, and a web n-gram model to accurately identify NE boundaries. We then evaluate this method automatically using DBpedia as a rich data source of NEs, with the aid of a small representative random sample that is manually annotated. Finally, an analysis of the types of named entities that often occur in a query log is conducted, from which a search query driven named entity taxonomy is presented.
Article
The information extraction task of Named Entities Recognition (NER) has been recently applied to search engine queries, in order to better understand their semantics. Here we concentrate on the task prior to the classification of the named entities (NEs) into a set of categories, which is the problem of detecting candidate NEs via the subtask of query segmentation.We present a novel method for detecting candidate NEs using grammar annotation and query segmentation with the aid of top-n snippets from search engine results and a web n-gram model, to accurately identify NE boundaries. The proposed method addresses the problem of accurately setting boundaries of NEs and the detection of multiple NEs in queries.
Conference Paper
A seed-based framework for textual information extraction allows for weakly supervised extraction of named entities from anonymized Web search queries. The extraction is guided by a small set of seed named entities, without any need for handcrafted extraction patterns or domain-specific knowledge, allowing for the acquisition of named entities pertaining to various classes of interest to Web search users. Inherently noisy search queries are shown to be a highly valuable, albeit little explored, resource for Web-based named entity discovery.
Conference Paper
Recently, the problem of Named Entity Recognition in Query (NERQ) is attracting increasingly attention in the field of information retrieval. However, the lack of context information in short queries makes some classical named entity recognition (NER) algorithms fail. In this paper, we propose to utilize the search session information before a query as its context to address this limitation. We propose to improve two classical NER solutions by utilizing the search session context, which are known as Conditional Random Field (CRF) based solution and Topic Model based solution respectively. In both approaches, the relationship between current focused query and previous queries in the same session are used to extract novel context aware features. Experimental results on real user search session data show that the NERQ algorithms using search session context performs significantly better than the algorithms using only information of the short queries.
Conference Paper
This paper addresses the problem of Named Entity Recog- nition in Query (NERQ), which involves detection of the named entity in a given query and classification of the named entity into predefined classes. NERQ is potentially useful in many applications in web search. The paper proposes tak- ing a probabilistic approach to the task using query log data and Latent Dirichlet Allocation. We consider contexts of a named entity (i.e., the remainders of queries after the named entity is removed) as words of a document, and classes of the named entity as topics. The topic model is constructed by a novel and general learning method referred to as WS-LDA (Weakly Supervised Latent Dirichlet Allocation), which em- ploys weakly supervised learning (rather than unsupervised learning) using partially labeled seed entities. Experimental results show that the proposed method based on WS-LDA can accurately perform NERQ, and outperform the baseline methods.
Conference Paper
In this paper we propose a completely unsupervised method for open-domain entity extraction and clustering over query logs. The underlying hypothesis is that classes defined by mining search user activity may significantly differ from those typically considered over web documents, in that they better model the user space, i.e. users' perception and interests. We show that our method outperforms state of the art (semi-)supervised systems based either on web documents or on query logs (16% gain on the clustering task). We also report evidence that our method successfully supports a real world application, namely keyword generation for sponsored search.
Article
We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance.