Conference PaperPDF Available

Do it yourself (DIY) Jeopardy QA System (Demo)

January 2013

January 2013

Conference: 12th International Semantic Web Conference (ISWC 2013)

Authors:

Andre Freitas

University of Galway

Edward Curry

University of Galway

The evolution and maturity of semantic technologies techniques and frameworks are bringing functionalities which were once considered academic or prototyp- ical into real-life applications. Products such as IBM Watson [1] and Siri are examples of applications which are heavily leveraged on state-of-the-art seman- tic technologies. These systems provide a synthesis of the functionalities which are available for general applications today such as: natural language search and queries over large-scale data, semantic flexibility and integration between struc- tured and unstructured resources. The success of these projects in demonstrating the potential of existing technologies lies on the fact that they bring into a sin- gle system approaches from Natural Language Processing (NLP), SemanticWeb (SW), Information Retrieval (IR) and Databases. This work demonstrates Treo, a framework which converges elements from NLP, IR, SWand Databases, to create a semantic search engine and question an- swering (QA) system for heterogeneous data. Jeopardy and Question Answering queries over open domain structured and unstructured data are used to demon- strate the approach. In this work, Treo is extended to cope with unstructured text in addition to structured data. The setup of the framework is done in 3 steps and can be adapted to other datasets in a simple DIY process.

(A) Semantic indexing and query processing architecture. (B) Entity-centric representation of structured and unstructured data.

…

Example queries: (1,2) Queries over structured data (3,4) Jeopardy queries over structured and unstructured data.

…

Figures - uploaded by Edward Curry

Content may be subject to copyright.

Content uploaded by Edward Curry

Content may be subject to copyright.

Do it your own (DIY) Jeopardy Question

Answering System

Andr´e Freitas and Edward Curry

Digital Enterprise Research Institute (DERI)

National University of Ireland, Galway

1 Motivation

The evolution and maturity of semantic technologies techniques and frameworks

are bringing functionalities which were once considered academic or prototyp-

ical into real-life applications. Products such as IBM Watson [1] and Siri are

examples of applications which are heavily leveraged on state-of-the-art seman-

tic technologies. These systems provide a synthesis of the functionalities which

are available for general applications today such as: natural language search and

queries over large-scale data, semantic ﬂexibility and integration between struc-

tured and unstructured resources. The success of these projects in demonstrating

the potential of existing technologies lies on the fact that they bring into a sin-

gle system approaches from Natural Language Processing (NLP), Semantic Web

(SW), Information Retrieval (IR) and Databases.

This work demonstrates Treo, a framework which converges elements from

NLP, IR, SW and Databases, to create a semantic search engine and question an-

swering (QA) system for heterogeneous data. Jeopardy and Question Answering

queries over open domain structured and unstructured data are used to demon-

strate the approach. In this work, Treo is extended to cope with unstructured

text in addition to structured data. The setup of the framework is done in 3

steps and can be adapted to other datasets in a simple DIY process.

2 Treo: Querying structured & unstructured data

Treo supports free natural language queries over both structured and unstruc-

tured data. To enable semantic ﬂexibility and vocabulary independency in the

query process, a principled distributional-compositional semantic model is used

to build a distributional structured vector space model (τ−Space) [2]. Distri-

butional semantics focuses on the automatic construction of a semantic model

based on the statistical distribution of co-located words in large-scale corpora.

The distributional semantics component of the model, supports a semantic ap-

proximation between query and dataset terms: operations in the τ−Space are

mapped to semantic relatedness operations using the distributional model as a

commonsense knowledge base [2]. The automatic creation of distributional se-

mantic models supports the transportability of the approach to other datasets

2 Andr´e Freitas and Edward Curry

and languages, not requiring the manual creation eﬀort of ontologies (Treo does

not rely on ontology-based reasoning for semantic approximation).

In addition to queries over structured data, this work extends the query

mechanism for searching entities in unstructured text. Both structured and un-

structured data are linked in an entity-centric semantic index (Figure 1 (B)).

The elements of the query processing approach are depicted in Figure 1 (A).

Two diﬀerent query processing strategies are used:

- Query processing over structured data: In the query pre-processing phase,

the natural language query is analyzed by the Interpreter component, where a

set of query triple patterns and features are detected in the user query. The

second phase consists of the vocabulary independent query processing approach

which deﬁnes a sequence of search and data transformation operations over the

structured data graph embedded in the τ−Space [2], targeting the maximiza-

tion of the semantic matching with the query. The Query Planner generates

the sequence of semantic search, navigation and transformation operations over

the graph data, which deﬁnes the query processing plan, based on a set of query

features which are determined in the pre-processing phase. The third phase con-

sists in the execution of the query processing plan operations over the τ−Space

index.

- Query processing over structured & unstructured data: In case the

query is not addressed by the available structured data, the query can be pro-

cessed against both structured data and unstructured text in the entity-centric

index. The query pre-processing approach for this query type consists on the de-

tection of the query focus by the application of POS Tag based rules and by the

detection and resolution of named entities in the query. The query plan consists

of the composition of keyword-search operations over the text segments asso-

ciated with entities, distributional search operations over structured data, and

keyword search over associated entities. A ranking function weights the results of

all operations, also taking into account the cardinality for each entity (number of

associated entities, facts and text segments). The initial top-20 entity results are

re-ranked based on the computation of the distributional semantic relatedness

scores between the query focus phrase and the associated entity types.

3 DIY Setup Process

The setup of the Treo platform for a new dataset consists in the creation of a

semantic index for both structured and unstructured data, which requires three

steps:

1. Construction of the distributional semantic model: Consists on the use of a

large-scale reference corpora to build the distributional semantic reference

model [2]. In this demonstration Wikipedia 2006 is used as the reference

corpus and Explicit Semantic Analysis (ESA) is the distributional semantic

model.

2. Semantic indexing of structured data: Consists in the indexing of structured

data using the distributional semantic reference model [2]. The framework

Do it your own (DIY) Jeopardy Question Answering System 3

:company :Bad_Robot_Productions

:creator :J._J._Abrams

:format :Action_(fiction)

:location :Walt_Disney_Studios_(Burbank)

:location :Burbank,_California

:network :American_Broadcasting_Company

:numberOfEpisodes 105

:numberOfSeasons 5

:releaseDate 2001-09-30

:starring :Amy_Acker

:starring :Jennifer Garner

...

:Alias(TV Series)

DBpedia

:type :2006AmericanTelevisionSeriesEndings

:type :2001AmericanTelevisionSeriesDebuts

:type :BadRobotProductions

:Jack Bristow (:Victor Garber) is

Sydney's father and also works for

:SD-6 as a double agent for the :CIA.

:hasSentence

YAGO

Wikipedia

It stars :Jennifer Garner as :Sydney

Bristow, a CIA agent.

...

Natural Language Query:

Was Margareth Thatcher a

chemist ?

[[:Bill Clinton]] - daughter -

married

Query

Interpreter

Dependency

Parser

Pre-Processing

Query

Planner

Querry

Processor

Query Processing

pre-processed query

Distributional

search operations

Entity Search

Disambiguation

user

feedback

user

feedback Disambiguation

Operators

Answer:

Yes

Triples:

Margareth Thatcher’s type is English Chemists

Margareth Thatcher’s profession is chemist

Datasets

Reference

Corpora

Indexing

Explicit

Semantic

Analysis (ESA)

Distributional

Indexer

concept vectors

Document

Collection

Text

Indexer

NER

Distributional

Compositional

Index

(Ƭ-Space)

Entity-Text

Index

Fig. 1: (A) Semantic indexing and query processing architecture. (B) Entity-centric

representation of structured and unstructured data.

takes as input data any dataset following an Entity-Attribute-Value (EAV)

format. DBpedia 3.7 and YAGO are used as the demonstration datasets.

3. Unstructured data entity-centric indexing: This step takes as input a text

collection, recognizes the named entities based on the structured data pre-

viously indexed, aligning it with the indexed structured data. The demon-

stration uses Wikipedia 2013 as the test collection.

The steps are executed by calling one script, which takes as input the three

types of resources (reference corpora, structured datasets and unstructured texts).

After the setup, natural language queries can be executed against the structured

and unstructured data indexes. Figure 1 shows the components of the Treo ar-

chitecture (A) and an example of the entity-centric linking between structured

and unstructured data (B).

4 Demonstration

The system is demonstrated over the open-domain DBpedia 3.7 /YAGO RDF

datasets and Wikipedia 2013 text data. The RDF datasets consist of 128,071,259

triples (17GB) loaded into the Treo index for structured data. A set of natural

language queries from the Jeopardy challenge 1and from the Question Answering

over Linked Data challenge2are used to demonstrate the system. In the demon-

stration, users input free natural language queries and the system returns two

1http://j-archive.com/

2QALD-1, http://www.sc.cit-ec.uni-bielefeld.de/qald-1, 2011

4 Andr´e Freitas and Edward Curry

Fig. 2: Example queries: (1,2) Queries over structured data (3,4) Jeopardy queries over

structured and unstructured data.

types of results: (i) a list of highly related triples or (ii) post-processed results,

depending on the query type.

Figure 2 (2) shows the output of a query over the structured data index

for the query ‘Was Margaret Thatcher a chemist?’. In addition to the post-

processed answer, which provides a direct (QA-style) answer for the query, the

mechanism shows the justiﬁcation for the answer with the supporting triples.

Figure 2 (1) shows a query over structured data with a complex query plan

(‘Which cities in New Jersey have more than 10000 inhabitants?’ ). Figure 2 (3)

and (4) show examples of Jeopardy queries, which typically provide a natural

language description of a named entity or concept (for example: ‘Sydney’s dad,

Jack, was a CIA double agent working against SD-6 on this Jennifer Garner

show’ ). Further examples can be found online3.

Acknowledgments. This work was funded by SFI Ireland (SFI/08/CE/I1380).

References

1. D. Ferrucci et al., Building Watson: An Overview of the DeepQA Project, AI Mag-

azine, 2010.

2. A. Freitas, E. Curry, J. G. Oliveira, S. O’Riain, A Distributional Structured Se-

mantic Space for Querying RDF Graph Data. International Journal of Semantic

Computing (IJSC), 2012.

3http://treo.deri.ie/ISWC2013Demo

ResearchGate has not been able to resolve any citations for this publication.

Building Watson: An Overview of the DeepQA Project

Article

Full-text available

Sep 2010
AI MAG

IBM Research undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show, Jeopardy. The extent of the challenge includes fielding a real-time automatic contestant on the show, not merely a laboratory exercise. The Jeopardy Challenge helped us address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After three years of intense research and development by a core team of about 20 researchers, Watson is performing at human expert levels in terms of precision, confidence, and speed at the Jeopardy quiz show. Our results strongly suggest that DeepQA is an effective and extensible architecture that can be used as a foundation for combining, deploying, evaluating, and advancing a wide range of algorithmic techniques to rapidly advance the field of question answering (QA).

A Distributional Structured Semantic Space for Querying RDF Graph Data

Article

Full-text available

Apr 2012

The vision of creating a linked data web brings together the challenge of allowing queries across highly heterogeneous and distributed datasets. In order to query linked data on the Web today, end users need to be aware of which datasets potentially contain the data and also which data model describes these datasets. The process of allowing users to expressively query relationships in RDF while abstracting them from the underlying data model represents a fundamental problem for Web-scale linked data consumption. This article introduces a distributional structured semantic space which enables data model independent natural language queries over RDF data. The center of the approach relies on the use of a distributional semantic model to address the level of semantic interpretation demanded to build the data model independent approach. The article analyzes the geometric aspects of the proposed space, providing its description as a distributional structured vector space, which is built upon the generalized vector space model (GVSM). The final semantic space proved to be flexible and precise under real-world query conditions achieving mean reciprocal rank = 0.516, avg. precision = 0.482 and avg. recall = 0.491.

Do it yourself (DIY) Jeopardy QA System (Demo)

Abstract and Figures

Recommended publications

Inbenta Semantic Search Engine: a search engine inspired by the Meaning-Text Theory

Ontology based information retrieval system for Academic Library

Linking Semantic Desktop Data to the Web of Data

Semantic search engine and its strategies with IAN encoder