Conference PaperPDF Available

Do it yourself (DIY) Jeopardy QA System (Demo)

Authors:

Abstract and Figures

The evolution and maturity of semantic technologies techniques and frameworks are bringing functionalities which were once considered academic or prototyp- ical into real-life applications. Products such as IBM Watson [1] and Siri are examples of applications which are heavily leveraged on state-of-the-art seman- tic technologies. These systems provide a synthesis of the functionalities which are available for general applications today such as: natural language search and queries over large-scale data, semantic flexibility and integration between struc- tured and unstructured resources. The success of these projects in demonstrating the potential of existing technologies lies on the fact that they bring into a sin- gle system approaches from Natural Language Processing (NLP), SemanticWeb (SW), Information Retrieval (IR) and Databases. This work demonstrates Treo, a framework which converges elements from NLP, IR, SWand Databases, to create a semantic search engine and question an- swering (QA) system for heterogeneous data. Jeopardy and Question Answering queries over open domain structured and unstructured data are used to demon- strate the approach. In this work, Treo is extended to cope with unstructured text in addition to structured data. The setup of the framework is done in 3 steps and can be adapted to other datasets in a simple DIY process.
Content may be subject to copyright.
Do it your own (DIY) Jeopardy Question
Answering System
Andr´e Freitas and Edward Curry
Digital Enterprise Research Institute (DERI)
National University of Ireland, Galway
1 Motivation
The evolution and maturity of semantic technologies techniques and frameworks
are bringing functionalities which were once considered academic or prototyp-
ical into real-life applications. Products such as IBM Watson [1] and Siri are
examples of applications which are heavily leveraged on state-of-the-art seman-
tic technologies. These systems provide a synthesis of the functionalities which
are available for general applications today such as: natural language search and
queries over large-scale data, semantic flexibility and integration between struc-
tured and unstructured resources. The success of these projects in demonstrating
the potential of existing technologies lies on the fact that they bring into a sin-
gle system approaches from Natural Language Processing (NLP), Semantic Web
(SW), Information Retrieval (IR) and Databases.
This work demonstrates Treo, a framework which converges elements from
NLP, IR, SW and Databases, to create a semantic search engine and question an-
swering (QA) system for heterogeneous data. Jeopardy and Question Answering
queries over open domain structured and unstructured data are used to demon-
strate the approach. In this work, Treo is extended to cope with unstructured
text in addition to structured data. The setup of the framework is done in 3
steps and can be adapted to other datasets in a simple DIY process.
2 Treo: Querying structured & unstructured data
Treo supports free natural language queries over both structured and unstruc-
tured data. To enable semantic flexibility and vocabulary independency in the
query process, a principled distributional-compositional semantic model is used
to build a distributional structured vector space model (τSpace) [2]. Distri-
butional semantics focuses on the automatic construction of a semantic model
based on the statistical distribution of co-located words in large-scale corpora.
The distributional semantics component of the model, supports a semantic ap-
proximation between query and dataset terms: operations in the τSpace are
mapped to semantic relatedness operations using the distributional model as a
commonsense knowledge base [2]. The automatic creation of distributional se-
mantic models supports the transportability of the approach to other datasets
2 Andr´e Freitas and Edward Curry
and languages, not requiring the manual creation effort of ontologies (Treo does
not rely on ontology-based reasoning for semantic approximation).
In addition to queries over structured data, this work extends the query
mechanism for searching entities in unstructured text. Both structured and un-
structured data are linked in an entity-centric semantic index (Figure 1 (B)).
The elements of the query processing approach are depicted in Figure 1 (A).
Two different query processing strategies are used:
- Query processing over structured data: In the query pre-processing phase,
the natural language query is analyzed by the Interpreter component, where a
set of query triple patterns and features are detected in the user query. The
second phase consists of the vocabulary independent query processing approach
which defines a sequence of search and data transformation operations over the
structured data graph embedded in the τSpace [2], targeting the maximiza-
tion of the semantic matching with the query. The Query Planner generates
the sequence of semantic search, navigation and transformation operations over
the graph data, which defines the query processing plan, based on a set of query
features which are determined in the pre-processing phase. The third phase con-
sists in the execution of the query processing plan operations over the τSpace
index.
- Query processing over structured & unstructured data: In case the
query is not addressed by the available structured data, the query can be pro-
cessed against both structured data and unstructured text in the entity-centric
index. The query pre-processing approach for this query type consists on the de-
tection of the query focus by the application of POS Tag based rules and by the
detection and resolution of named entities in the query. The query plan consists
of the composition of keyword-search operations over the text segments asso-
ciated with entities, distributional search operations over structured data, and
keyword search over associated entities. A ranking function weights the results of
all operations, also taking into account the cardinality for each entity (number of
associated entities, facts and text segments). The initial top-20 entity results are
re-ranked based on the computation of the distributional semantic relatedness
scores between the query focus phrase and the associated entity types.
3 DIY Setup Process
The setup of the Treo platform for a new dataset consists in the creation of a
semantic index for both structured and unstructured data, which requires three
steps:
1. Construction of the distributional semantic model: Consists on the use of a
large-scale reference corpora to build the distributional semantic reference
model [2]. In this demonstration Wikipedia 2006 is used as the reference
corpus and Explicit Semantic Analysis (ESA) is the distributional semantic
model.
2. Semantic indexing of structured data: Consists in the indexing of structured
data using the distributional semantic reference model [2]. The framework
Do it your own (DIY) Jeopardy Question Answering System 3
:company :Bad_Robot_Productions
:creator :J._J._Abrams
:format :Action_(fiction)
:location :Walt_Disney_Studios_(Burbank)
:location :Burbank,_California
:network :American_Broadcasting_Company
:numberOfEpisodes 105
:numberOfSeasons 5
:releaseDate 2001-09-30
:starring :Amy_Acker
:starring :Jennifer Garner
...
:Alias(TV Series)
DBpedia
:type :2006AmericanTelevisionSeriesEndings
:type :2001AmericanTelevisionSeriesDebuts
:type :BadRobotProductions
:Jack Bristow (:Victor Garber) is
Sydney's father and also works for
:SD-6 as a double agent for the :CIA.
:hasSentence
:hasSentence
YAGO
Wikipedia
It stars :Jennifer Garner as :Sydney
Bristow, a CIA agent.
...
Natural Language Query:
Was Margareth Thatcher a
chemist ?
[[:Bill Clinton]] - daughter -
married
Query
Interpreter
Dependency
Parser
Pre-Processing
Query
Planner
Querry
Processor
Query Processing
pre-processed query
Distributional
Search
search operations
Entity Search
Disambiguation
user
feedback
user
feedback Disambiguation
Operators
Answer:
Yes
Triples:
Margareth Thatcher’s type is English Chemists
Margareth Thatcher’s profession is chemist
Datasets
Reference
Corpora
Indexing
Explicit
Semantic
Analysis (ESA)
Distributional
Indexer
concept vectors
Document
Collection
Text
Indexer
NER
1
2
3
Distributional
Compositional
Index
(Ƭ-Space)
+
Entity-Text
Index
AB
Fig. 1: (A) Semantic indexing and query processing architecture. (B) Entity-centric
representation of structured and unstructured data.
takes as input data any dataset following an Entity-Attribute-Value (EAV)
format. DBpedia 3.7 and YAGO are used as the demonstration datasets.
3. Unstructured data entity-centric indexing: This step takes as input a text
collection, recognizes the named entities based on the structured data pre-
viously indexed, aligning it with the indexed structured data. The demon-
stration uses Wikipedia 2013 as the test collection.
The steps are executed by calling one script, which takes as input the three
types of resources (reference corpora, structured datasets and unstructured texts).
After the setup, natural language queries can be executed against the structured
and unstructured data indexes. Figure 1 shows the components of the Treo ar-
chitecture (A) and an example of the entity-centric linking between structured
and unstructured data (B).
4 Demonstration
The system is demonstrated over the open-domain DBpedia 3.7 /YAGO RDF
datasets and Wikipedia 2013 text data. The RDF datasets consist of 128,071,259
triples (17GB) loaded into the Treo index for structured data. A set of natural
language queries from the Jeopardy challenge 1and from the Question Answering
over Linked Data challenge2are used to demonstrate the system. In the demon-
stration, users input free natural language queries and the system returns two
1http://j-archive.com/
2QALD-1, http://www.sc.cit-ec.uni-bielefeld.de/qald-1, 2011
4 Andr´e Freitas and Edward Curry
1
2
3
4
Fig. 2: Example queries: (1,2) Queries over structured data (3,4) Jeopardy queries over
structured and unstructured data.
types of results: (i) a list of highly related triples or (ii) post-processed results,
depending on the query type.
Figure 2 (2) shows the output of a query over the structured data index
for the query ‘Was Margaret Thatcher a chemist?’. In addition to the post-
processed answer, which provides a direct (QA-style) answer for the query, the
mechanism shows the justification for the answer with the supporting triples.
Figure 2 (1) shows a query over structured data with a complex query plan
(‘Which cities in New Jersey have more than 10000 inhabitants?’ ). Figure 2 (3)
and (4) show examples of Jeopardy queries, which typically provide a natural
language description of a named entity or concept (for example: ‘Sydney’s dad,
Jack, was a CIA double agent working against SD-6 on this Jennifer Garner
show’ ). Further examples can be found online3.
Acknowledgments. This work was funded by SFI Ireland (SFI/08/CE/I1380).
References
1. D. Ferrucci et al., Building Watson: An Overview of the DeepQA Project, AI Mag-
azine, 2010.
2. A. Freitas, E. Curry, J. G. Oliveira, S. O’Riain, A Distributional Structured Se-
mantic Space for Querying RDF Graph Data. International Journal of Semantic
Computing (IJSC), 2012.
3http://treo.deri.ie/ISWC2013Demo
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
IBM Research undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show, Jeopardy. The extent of the challenge includes fielding a real-time automatic contestant on the show, not merely a laboratory exercise. The Jeopardy Challenge helped us address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After three years of intense research and development by a core team of about 20 researchers, Watson is performing at human expert levels in terms of precision, confidence, and speed at the Jeopardy quiz show. Our results strongly suggest that DeepQA is an effective and extensible architecture that can be used as a foundation for combining, deploying, evaluating, and advancing a wide range of algorithmic techniques to rapidly advance the field of question answering (QA).
Article
Full-text available
The vision of creating a linked data web brings together the challenge of allowing queries across highly heterogeneous and distributed datasets. In order to query linked data on the Web today, end users need to be aware of which datasets potentially contain the data and also which data model describes these datasets. The process of allowing users to expressively query relationships in RDF while abstracting them from the underlying data model represents a fundamental problem for Web-scale linked data consumption. This article introduces a distributional structured semantic space which enables data model independent natural language queries over RDF data. The center of the approach relies on the use of a distributional semantic model to address the level of semantic interpretation demanded to build the data model independent approach. The article analyzes the geometric aspects of the proposed space, providing its description as a distributional structured vector space, which is built upon the generalized vector space model (GVSM). The final semantic space proved to be flexible and precise under real-world query conditions achieving mean reciprocal rank = 0.516, avg. precision = 0.482 and avg. recall = 0.491.