Conference Paper

Web-assisted annotation, semantic indexing and search of television and radio news

January 2005

January 2005

DOI:10.1145/1060745.1060781

Source
DBLP

Conference: Proceedings of the 14th international conference on World Wide Web, WWW 2005, Chiba, Japan, May 10-14, 2005

Authors:

Valentin Tablan

Ieso Digital Health

Borislav Popov

Ontotext AD

The Rich News system, that can automatically annotate radio and television news with the aid of resources retrieved from the World Wide Web, is described. Automatic speech recognition gives a temporally precise but conceptually inaccurate annotation model. Information extraction from related web news sites gives the opposite: conceptual accuracy but no temporal data. Our approach combines the two for temporally accurate conceptual semantic annotation of broadcast news. First low quality transcripts of the broadcasts are produced using speech recognition, and these are then automatically divided into sections corresponding to individual news stories. A key phrases extraction component finds key phrases for each story and uses these to search for web pages reporting the same event. The text and meta-data of the web pages is then used to create index documents for the stories in the original broadcasts, which are semantically annotated using the KIM knowledge management platform. A web interface then allows conceptual search and browsing of news stories, and playing of the parts of the media files corresponding to each news story. The use of material from the World Wide Web allows much higher quality textual descriptions and semantic annotations to be produced than would have been possible using the ASR transcript directly. The semantic annotations can form a part of the Semantic Web, and an evaluation shows that the system operates with high precision, and with a moderate level of recall.

Semantic Knowledge Graphs for the News: A Review

Article

Full-text available

Jun 2022

ICT platforms for news production, distribution, and consumption must exploit the ever-growing availability of digital data. These data originate from different sources and in different formats; they arrive at different velocities and in different volumes. Semantic knowledge graphs (KGs) is an established technique for integrating such heterogeneous information. It is therefore well-aligned with the needs of news producers and distributors, and it is likely to become increasingly important for the news industry. This paper reviews the research on using semantic knowledge graphs for production, distribution, and consumption of news. The purpose is to present an overview of the field; to investigate what it means; and to suggest opportunities and needs for further research and development.

Learning ontologies from software artifacts: Exploring and combining multiple sources

Article

Full-text available

Jan 2006

While early efforts on applying Semantic Web technologies to solve software engineering related problems show promising results, the very basic process of augmenting software artifacts with their se-mantic representations is still an open issue. Indeed, exiting techniques to learn ontologies that describe the domain of a certain software project either 1) explore only one information source associated to this project or 2) employ supervised and domain specific techniques. In this paper we present an ontology learning approach that 1) exploits a range of information sources associated with software projects and 2) relies on techniques that are portable across application domains.

A Method to Extract the Forensic about Negative Issues from Web

Article

Full-text available

Mar 2017

In the social world there are many issues: positive or negative. The negative issues affect the level of social comfort. On social media such as Web, every issue positioned based on the document, which has its own attributes, such as the URL address and date of creation. Not easy to extract information from the Web, as well as to determine the origin of an issue that is flowing in the web. This paper is to derive a method for revealing the origin of an issue based on the characteristics of each webpage.

Automated semantic tagging of speech audio

Article

Full-text available

Apr 2012

The BBC is currently tagging programmes manually, using DBpedia as a source of tag identifiers, and a list of suggested tags extracted from the programme synopsis. These tags are then used to help navigation and topic-based search of programmes on the BBC website. However, given the very large number of programmes available in the archive, most of them having very little metadata attached to them, we need a way to automatically assign tags to programmes. We describe a framework to do so, using speech recognition, text processing and concept tagging techniques. We describe how this framework was successfully applied to a very large BBC radio archive. We demonstrate an application using automatically extracted tags to aid discovery of archive content.

From video segmentation to semantic indexing : the PrestoSpace approach

Conference Paper

Full-text available

Nov 2006

This paper will present the contribution of the European PrestoSpace project to the study and development of a Metadata Access and Delivery (MAD) platform for multimedia and television broadcast archives. The MAD system aims at generating, validating and delivering to archive users metadata created by automatic and semi- automatic information extraction processes. The MAD publication platform employs audiovisual content analysis, speech recognition (ASR) and semantic analysis tools. It then provides intelligent facilities to access the imported and newly produced metadata. The possibilities opened by the PrestoSpace framework to intelligent indexing and retrieval of multimedia objects within large scale archives apply as well to more general scenarios where semantic information is needed to cope with the complexity of the search process.

Open Up Cultural Heritage in Video Archives with Mediaglobe

Conference Paper

Full-text available

Jan 2012

Semantically enhanced television news through web and video integration

Article

The Rich News system for semantically annotating television news broadcasts and augmenting them with additional web content is described. Online news sources were mined for material reporting the same stories as those found in television broadcasts, and the text of these pages was semantically an-notated using the KIM knowledge management platform. This resulted in more effective indexing than would have been possible if the programme transcript was indexed directly, owing to the poor quality of transcripts produced using automatic speech recognition. In addition, the associations produced between web pages and television broadcasts enables the automatic creation of aug-mented interactive television broadcasts and multi-media websites.

Reaction or Speculation: Building Computational Support for Users in Catching-Up Series Based on an Emerging Media Consumption Phenomenon

Article

Full-text available

Apr 2021

A growing number of people are using catch-up TV services rather than watching simultaneously with other audience members at the time of broadcast. However, computational support for such catching-up users has not been well explored. In particular, we are observing an emerging phenomenon in online media consumption experiences in which speculation plays a vital role. As the phenomenon of speculation implicitly assumes simultaneity in media consumption, there is a gap for catching-up users, who cannot directly appreciate the consumption experiences. This conversely suggests that there is potential for computational support to enhance the consumption experiences of catching-up users. Accordingly, we conducted a series of studies to pave the way for developing computational support for catching-up users. First, we conducted semi-structured interviews to understand how people are engaging with speculation during media consumption. As a result, we discovered the distinctive aspects of speculation-based consumption experiences in contrast to social viewing experiences sharing immediate reactions that have been discussed in previous studies. We then designed two prototypes for supporting catching-up users based on our quantitative analysis of Twitter data in regard to reaction- and speculation-based media consumption. Lastly, we evaluated the prototypes in a user experiment and, based on its results, discussed ways to empower catching-up users with computational supports in response to recent transformations in media consumption.

Reaction or Speculation: Building Computational Support for Users in Catching-Up Series Based on an Emerging Media Consumption Phenomenon

Preprint

Full-text available

Feb 2021

A Framework for Semantic Video Content Indexing Using Textual Information

Conference Paper

Full-text available

Aug 2018

Using Linked Data Resources to Generate Web Pages based on a BBC Case Study

Conference Paper

Jul 2018

Many organizations have attempted to automatically exploit the data embedded in web pages and enrich the web with a semantic dimension. Data should follow the principles outlined by Tim Berners Lee, which is based on traditional web technologies, such as, Uniform Resource Identifier (URI) and Hypertext Transmission Protocol (HTTP), and semantic web technologies, including knowledge representation languages, such as, Resource Description Framework (RDF), as well as links to other data. The uses of linked data technology are numerous and varied. The case of the British Broadcasting Corporation (BBC) is the most widely reported success story of linked data technology usage in the literature. This success stems from the use of linked data in the BBC web portal, which enables the site to present rich content that is automatically updated from linked data cloud. The aim of this study is to analyze the literature relating to this case study to derive approaches and technologies for linked data usage and propose a group of best practices, as well as a generic approach that can be used by web developers.

The Groningen Meaning Bank

Chapter

Full-text available

Jun 2017

The goal of the Groningen Meaning Bank (GMB) is to obtain a large corpus of English texts annotated with formal meaning representations. Since manually annotating a comprehensive corpus with deep semantic representations is a hard and time-consuming task, we employ a sophisticated bootstrapping approach. This method employs existing language technology tools (for segmentation, part-of-speech tagging, named entity tagging, animacy labelling, syntactic parsing, and semantic processing) to get a reasonable approximation of the target annotations as a starting point. The machine-generated annotations are then refined by information obtained from both expert linguists (using a wiki-like platform) and crowd-sourcing methods (in the form of a 'Game with a Purpose') which help us in deciding how to resolve syntactic and semantic ambiguities. The result is a semantic resource that integrates various linguistic phenomena, including predicate-argument structure , scope, tense, thematic roles, rhetorical relations and presuppositions. The semantic formalism that brings all levels of annotation together in one meaning representation is Discourse Representation Theory, which supports meaning representations that can be translated to first-order logic. In contrast to ordinary treebanks, the units of annotation in the GMB are texts, rather than isolated sentences. The current version of the GMB contains more than 10,000 public domain texts aligned with Discourse Representation Structures, and is freely available for research purposes.

A Hybrid Approach for Persian Named Entity Recognition

Article

Full-text available

Mar 2017

Named entity recognition (NER) is an information extraction subtask that attempts to recognize and categorize named entities in unstructured text into predefined categories such as the names of people, organizations, and locations. Recently, machine learning approaches, such as hidden Markov model (HMM) as well as hybrid methods, are frequently used to solve Name Entity Recognition. To the best of our knowledge, publicly available data sets for NER in Persian do not exist in any machine learning-based Persian NER system. Because of HMM innate weaknesses, in this paper, we have used both hidden Markov model and rule-based method to recognize named entities in Persian texts. The combination of rule-based method and machine learning method results in a high accurate recognition. The proposed system in its machine learning section uses HMM and Viterbi algorithms, and in its rule-based section employs a set of lexical resources and pattern bases for the recognition of named entities including the names of people, locations and organizations. During this study, we annotate our own training and testing data sets for use in the related phases. Our hybrid approach performs on Persian language with 89.73% precision, 82.44% recall, and 85.93% F-measure using an annotated test corpus including 32,606 tokens.

Preventing Social Engineering and Espionage in Collaborative Knowledge Management Systems (KMSs)

Article

Full-text available

Oct 2013

Insider attack and espionage on computer-based information is a major problem for business organizations and governments. Knowledge Management Systems (KMSs) are not exempt from this threat. Prior research presented the Congenial Access Control Model (CAC), a relationship-based access control model, as a better access control method for KMS because it reduces the adverse effect of stringent security measures on the usability of KMSs. However, the CAC model, like other models, e.g., Role Based Access Control (RBAC), Time-Based Access Control (TBAC), and History Based Access Control (HBAC), does not provide adequate protection against privilege abuse by authorized users that can lead to industrial espionage. In this paper, the authors provide an Espionage Prevention Model (EP) that uses Semantic web-based annotations on knowledge assets to store relevant information and compares it to the Friend-Of-A-Friend (FOAF) data of the potential recipient of the resource. It can serve as an additional layer to previous access control models, preferably the Congenial Access Control (CAC) model.

A hybrid method for Persian Named Entity Recognition

Conference Paper

Full-text available

May 2015

Named Entity Recognition (NER) is an information extraction subtask that attempts to recognize and categorize named entities in unstructured text into predefined categories such as the names of people, organizations, and locations. Recently, machine learning approaches, such as Hidden Markov Model (HMM) as well as hybrid methods, are frequently used to solve Name Entity Recognition. Since the absence of publicly available data sets for NER in Persian, as our knowledge does not exist any machine learning base Persian NER system. Because of HMM innate weaknesses, in this paper, we have used both Hidden Markov Model and rule-based method to recognize named entities in Persian texts. The combination of rule-based method and machine learning method results in a high accurate recognition. The proposed system in is machine learning section uses from HMM and Viterbi algorithms; and in its rule-based section employs a set of lexical resources and pattern bases for the recognition of named entities including the names of people, locations and organizations. During this study, we annotate our own training and testing data sets to use in the related phases. Our hybrid approach performs on Persian language with 89.73% precision, 82.44% recall, and 85.93% F-measure using an annotated test corpus including 32,606 tokens.

Semantic Annotation of Bangla News Stream to Record History

Conference Paper

Full-text available

Dec 2015

Every day thousands of news articles are published in Bangla from several different sources on the web and this number is even increasing rapidly. On the contrary, the readers are often selective to read their desired news only. In this connection, classical Information Extraction (IE) techniques are used to query with keywords from unstructured or semi-structured news contents to fulfill partial requirements. However, they cannot interpret sequences of events, relation among entities, inference some unveiled facts to facilitate further human analysis. To achieve this goal, semantic technology adds formal structure and semantics to the news stream. In this paper, we propose a system to analyze Bangla news content to annotate especially things, people and places with semantic technology automatically by extracting what happened, when, where and who being involved in the news with the help of classical Natural Language Processing (NLP) techniques. Furthermore, we relate news of today with the previous news to accumulate information over time. We present our proposed system of annotating Bangla news semantically and experiment with SPARQL to inference integrated news from different sources over time and shows its effectiveness in querying specific information.

A Hybrid Approach to Finding Relevant Social Media Content for Complex Domain Specific Information Needs

Article

Nov 2014
J WEB SEMANT

While contemporary semantic search systems offer to improve classical keyword-based search, they are not always adequate for complex domain specific information needs. The domain of prescription drug abuse, for example, requires knowledge of both ontological concepts and intelligible constructs not typically modeled in ontologies. These intelligible constructs convey essential information that include notions of intensity, frequency, interval, dosage and sentiments, which could be important to the holistic needs of the information seeker. In this paper, we present a hybrid approach to domain specific information retrieval (or knowledge-aware search system) that integrates ontology-driven query interpretation with synonym-based query expansion and domain specific rules, to facilitate search in social media. Our framework is based on a context-free grammar (CFG) that defines the query language of constructs interpretable by the search system. The grammar provides two levels of semantic interpretation: 1) a top-level CFG that facilitates retrieval of diverse textual patterns, which belong to broad templates and 2) a low-level CFG that enables interpretation of certain specific expressions that belong to such patterns. These low-level expressions occur as concepts from four different categories of data: 1) ontological concepts, 2) concepts in lexicons (such as emotions and sentiments), 3) concepts in lexicons with only partial ontology representation, called lexico-ontology concepts (such as side effects and routes of administration (ROA)), and 4) domain specific expressions (such as date, time, interval, frequency and dosage) derived solely through rules. Our approach is embodied in a novel Semantic Web platform called PREDOSE, which provides search support for complex domain specific information needs in prescription drug abuse epidemiology. When applied to a corpus of over 1 million drug abuse-related web forum posts, our search framework proved effective in retrieving relevant documents when compared with three existing search systems.

AUTOMATIC SEMANTIC METADATA GENERATION USING SPEECH/TEXT/FACE RECOGNITION TECHNOLOGY FOR MULTIMEDIA RETRIEVAL

Article

Recently, digital based systems such as a networked production system and a video archive system have been in construction inside a broadcasting station for efficient multimedia retrieval and management. While the importance of metadata cannot be overemphasized for multimedia retrieval service, it is very difficult to generate semantic metadata (e.g. titles, keywords, character's names, etc.) that are useful in broadcasting field through a pure audio-visual signal processing. The goal of our project is to develop a technology for generating semantic metadata from broadcast content using speech/text/face recognition methods for practical usage. The speech and text recognition engine optimised for news programs have been developed and integrated into data summarizing module together with the face recognition engine (introduced at IBC 2004). The recognition result extracted from these engines are merged and summarized based on the importance of words and TF-IDF concept to generate the semantic metadata for each scene of news. To evaluate the performance of the developed engine, the Automatic Metadata Generator software -OMEGA has been implemented. We have experimented it with a commercial MAM system to show the usefulness of our approach that is based on recognition and data summarization technology.

Enriquecimiento de Textos en Español Mediante Generación Automática de Hipertexto

Article

Full-text available

Aug 2010

Resumen Se presenta un proyecto actualmente en desarrollo cuyo objetivo es la creación de un modelo de enriquecimiento de textos basado en la integración de recursos disponibles en el espacio web. El modelo propuesto pretende transformar textos planos lineales en hipertextos que provean información y recursos multimedia sobre entidades reconocidas. Con esta aplicación los usuarios podrán transformar textos en hipertextos auto-explicativos que les posibilitarán una mayor comprensión y les ahorrarán realizar búsquedas individuales de información afín. La evolución al concepto de web 2.0 y la proliferación y popularización de buscadores alternativos, blogs, wikis, servicios de tagging, de question/answering, etc. resultan ideales para explotar de manera eficiente los recursos que provee Internet y utilizarlos estratégicamente en el enriquecimiento de texto. Palabras clave: hipertexto, reconocimiento de entidades, enriquecimiento de texto, content augmentation.

D3.1 Key concept identification and clustering of similar content

Article

EU-IST Strategic Targeted Research Project (STREP) IST-2004-026460 TAO Deliverable D3.1 (WP3) This deliverable is concerned with developing algorithms and tools for semantic annotation of legacy software artefacts, with respect to a given domain ontology. In the case of non-textual content, e.g., screen shots and design diagrams we have applied OCR software, prior to Informa-tion Extraction. The results have been made available as a web service, which is in the process of being refined and integrated within the TAO Suite.

Connotation Searching Method for Paper Retrieval System Based On Fuzzy Rules

Article

Full-text available

Oct 2007

In this paper, an overall framework on paper retrieval system based on paper's connotation is proposed. Paper database is sorted into four ranks. Each of them is mainly described by an extended keyword set, which is served as the carrier of the precise connotation of a paper. Based on the matching degree between the paper introduction's vocabulary and the extended keywords set of the topics, papers in those topics is selected by using fuzzy rules. So the papers whose connotation approximates to the user's interest can be obtained. Furthermore, an automatic method to identify new and hot topics is presented.

A platform for collaborative semantic annotation

Conference Paper

Full-text available

Jan 2012

Data-driven approaches in computational semantics are not common because there are only few semantically annotated resources available. We are building a large corpus of public-domain English texts and annotate them semi-automatically with syntactic structures (derivations in Combinatory Categorial Grammar) and semantic representations (Discourse Representation Structures), including events, thematic roles, named entities, anaphora, scope, and rhetorical structure. We have created a wiki-like Web-based platform on which a crowd of expert annotators (i.e. linguists) can log in and adjust linguistic analyses in real time, at various levels of analysis, such as boundaries (tokens, sentences) and tags (part of speech, lexical categories). The demo will illustrate the different features of the platform, including navigation, visualization and editing.

Ontology-based video retrieval

Article

Full-text available

Jan 2009
Int J Digit Cult Electron Tourism

Antonella Carbonaro

The paper presents an ontological approach for enabling semantic-aware video retrieval framework facilitating the user access to desired contents. Through the ontologies, the system will express key entities and relationships describing videos in a formal machine-processable representation. An ontology-based knowledge representation could be used for content analysis and concept recognition, for reasoning processes and for enabling user-friendly and intelligent multimedia content retrieval.

A Hierarchical Bayesian Model of Meetings that Combines Words and Discourse Features

Article

Full-text available

In order to determine the points at which meeting discourse changes from one topic to another, probabilistic models were used to approximate the process through which meeting transcripts were produced. Gibbs sampling was used to estimate parameter values in the models, including the locations of topic boundaries. The paper shows how discourse features were integrated into the Bayesian model, and reports empirical evaluations of the benefit obtained through the inclusion of each feature and of the suitability of alternative models of the placement of topic boundaries. It demonstrates how multiple cues to segmentation can be combined in a principled way, and empirical tests show a clear improvement over previous work.

NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools

Poster

Full-text available

Jan 2012

Named Entity Extraction is a mature task in the NLP field that has yielded numerous services gaining popularity in the Semantic Web community for extracting knowledge from web documents. These services are generally organized as pipelines, using dedicated APIs and different taxonomy for extracting, classifying and disambiguating named entities. Integrating one of these services in a particular application requires to implement an appropriate driver. Furthermore, the results of these services are not comparable due to different formats. This prevents the comparison of the performance of these services as well as their possible combination. We address this problem by proposing NERD, a framework which unifies 10 popular named entity extractors available on the web, and the NERD ontology which provides a rich set of axioms aligning the taxonomies of these tools.

Enhancing Patient Readability of Discharge Summaries with Automatically Generated Hyperlinks

Article

Full-text available

Dec 2009

Patients often have difficulty in understanding medical concepts and vocabulary in their Discharge Summaries. We explore automatic hyper-linking to online resources for difficult terms as a means of making the content more comprehensible for patients. We use the Consumer Health Vocabulary (CHV) as a resource for scoring the difficulty of terms and to provide the most consumer-friendly synonyms. We implement a term extraction component providing semantic annotation using the KIM Knowledge and Information Management Platform. We hyperlink these terms to pages indexed by MedLinePlus to provide consumer-friendly online explanations. A web interface allows for viewing annotated Discharge Summaries and browsing search results. In a preliminary evaluation, the system was used to annotate eight Clinical Management sections of Discharge Summaries. The automatic hyper-linking provides good precision in linking to topically-relevant pages indexed by MedLinePlus. Our approach shows promise as a technology to deploy in future portals where consumers view their Discharge Summaries online.

Web-based inference detection

Article

Aug 2007

Newly published data, when combined with existing public knowledge, allows for complex and sometimes unintended inferences. We propose semi-automated tools for detecting these inferences prior to releasing data. Our tools give data owners a fuller understanding of the implications of releasing data and help them ad-just the amount of data they release to avoid unwanted inferences. Our tools first extract salient keywords from the pri-vate data intended for release. Then, they issue search queries for documents that match subsets of these key-words, within a reference corpus (such as the public Web) that encapsulates as much of relevant public knowl-edge as possible. Finally, our tools parse the documents returned by the search queries for keywords not present in the original private data. These additional keywords allow us to automatically estimate the likelihood of cer-tain inferences. Potentially dangerous inferences are flagged for manual review. We call this new technology Web-based inference control. The paper reports on two experiments which demonstrate early successes of this technology. The first experiment shows the use of our tools to automatically estimate the risk that an anonymous document allows for re-identification of its author. The second experiment shows the use of our tools to detect the risk that a doc-ument is linked to a sensitive topic. These experiments, while simple, capture the full complexity of inference de-tection and illustrate the power of our approach.

Faceted Search and Retrieval Based on Semantically Annotated Product Family Ontology

Article

Feb 2009

With the advent of various services and applications of Semantic Web, semantic annotation had emerged as an important research area. The use of semantically annotated ontology had been evident in numerous information processing and retrieval tasks. One of such tasks is utilizing the semantically annotated ontology in product design which is able to suggest many important applications that are critical to aid various design related tasks. However, ontology development in design engineering remains a time consuming and tedious task that demands tremendous human efforts. In the context of product family design, management of different product information that features efficient indexing, update, navigation, search and retrieval across product families is both desirable and challenging. This paper attempts to address this issue by proposing an information management and retrieval framework based on the semantically annotated product family ontology. Particularly, we propose a document profile (DP) model to suggest semantic tags for annotation purpose. Using a case study of digital camera families, we illustrate how the faceted search and retrieval of product information can be accomplished based on the semantically annotated camera family ontology. Lastly, we briefly discuss some further research and application in design decision support, e.g. commonality and variety, based on the semantically annotated product family ontology.

Ontology-Based Video Annotation in Multimedia Entertainment

Article

Full-text available

Jan 2007

In this paper we propose ontology-based video content annotation and recommendation tools. Our system is able to perform automatic shot detection and supports users during the annotation phase in a collaborative framework by providing suggestions on the basis of actual user needs as well as modifiable user behaviour and interests. Annotations are based on domain ontologies expressing hierarchical links between entities and guarantying interoperability of resources. Examples to verify the effectiveness of both the shot detection and the frame matching modules are analyzed.

Searching the Web Using Screenshots

Article

Many online articles contain useful know-how knowledge about GUI applications. Even though these articles tend to be richly illustrated by screenshots, no system has been designed to take advantage of these screenshots to visu-ally search know-how articles effectively. In this paper, we present a novel system to index and search software know-how articles that leverages the visual correspondences be-tween screenshots. To retrieve articles about an application, users can take a screenshot of the application to query the system and retrieve a list of articles containing a matching screenshot. Useful snippets such as captions, references, and nearby text are automatically extracted from the retrieved articles and shown alongside with the thumbnails of the matching screenshots as excerpts for relevancy judgement. Retrieved articles are ranked by a comprehensive set of vi-sual, textual, and site features, whose weights are learned by RankSVM. Our prototype system currently contains 150k articles that are classified into walkthrough, book, gallery, and general categories. We demonstrated the system's abil-ity to retrieve matching screenshots for a wide variety of programs, across language boundaries, and provide subjec-tively more useful results than keyword-based web and im-age search engines.

Named Entity Recognition Experiments on Turkish Texts

Conference Paper

Full-text available

Oct 2009

Named entity recognition (NER) is one of the main information extraction tasks and research on NER from Turkish texts is known to be rare. In this study, we present a rule-based NER system for Turkish which employs a set of lexical resources and pattern bases for the extraction of named entities including the names of people, locations, organizations together with time/date and money/percentage expressions. The domain of the system is news texts and it does not utilize important clues of capitalization and punctuation since they may be missing in texts obtained from the Web or the output of automatic speech recognition tools. The evaluation of the system is performed on news texts along with other genres encompassing child stories and historical texts, but as expected in case of manually engineered rule-based systems, it suffers from performance degradation on these latter genres of texts since they are distinct from the target domain of news texts. Furthermore, the system is evaluated on transcriptions of news videos leading to satisfactory results which is an important step towards the employment of NER during automatic semantic annotation of videos in Turkish. The current study is significant for its being the first rule-based approach to the NER task on Turkish texts with its evaluation on diverse text types.

Semantic ambient media—an introduction

Article

Full-text available

Sep 2009

The medium is the message! And the message was literacy, media democracy and music charts. Mostly one single distinguishable medium such as TV, the Web, the radio, or books transmitted the message. Now in the age of ubiquitous and pervasive computing, where information flows through a plethora of distributed interlinked media—what is the message ambient media will tell us? What does semantic mean in this context? Which experiences will it open to us? What is content in the age of ambient media? Ambient media are embedded throughout the natural environment of the consumer—in his home, in his car, in restaurants, and on his mobile device. Predominant sample services are smart wallpapers in homes, location based services, RFID based entertainment services for children, or intelligent homes. The goal of this article is to define semantic ambient media and discuss the contributions to the Semantic Ambient Media Experience (SAME) workshop, which was held in conjunction with the ACM Multimedia conference in Vancouver in 2008. The results of the workshop can be found on: www.ambientmediaassociation.org.

Multi-facet product information search and retrieval using semantically annotated product family ontology

Article

Jul 2010
INFORM PROCESS MANAG

Detecting privacy leaks using corpus-based association rules

Conference Paper

Aug 2008

Detecting inferences in documents is critical for ensuring privacy when sharing information. In this paper, we propose a refined and practical model of inference detection using a reference corpus. Our model is inspired by association rule mining: inferences are based on word co-occurrences. Using the model and taking the Web as the reference corpus, we can find inferences and measure their strength through web-mining algorithms that leverage search engines such as Google or Yahoo!. Our model also includes the important case of private corpora, to model inference detection in enterprise settings in which there is a large private document repository. We find inferences in private corpora by using analogues of our Web-mining algorithms, relying on an index for the corpus rather than a Web search engine. We present results from two experiments. The first experiment demonstrates the performance of our techniques in identifying all the keywords that allow for inference of a particular topic (e.g. "HIV") with confidence above a certain threshold. The second experiment uses the public Enron e-mail dataset. We postulate a sensitive topic and use the Enron corpus and the Web together to find inferences for the topic. These experiments demonstrate that our techniques are practical, and that our model of inference based on word co-occurrence is well-suited to efficient inference detection.

Semantic Content Representation to Meet the Challenges of Video Analysis in Multimedia Entertainment

Conference Paper

Sep 2018

Using Linked Data Resources to Generate Web Pages Based on a BBC Case Study: Proceedings of the 2018 Computing Conference, Volume 2

Chapter

Jan 2019

Text-to-Concept: A Semantic Indexing Framework for Arabic News Videos: 18th International Conference, CICLing 2017, Budapest, Hungary, April 17–23, 2017, Revised Selected Papers, Part II

Chapter

Jan 2018

In these last years, many works have been published in the video indexing and retrieval field. However, few are the methods that have been designed to Arabic video. This paper’s aim is to achieve a new approach for Arabic news video indexing based on embedded text as the information source and Knowledge extraction techniques to provide a conceptual description of video content. Firstly, we applied a low level processing in order to detect and recognize the video texts. Then, we extract the conceptual information including name of person, Organization and location using local grammars that have been implemented with the linguistic platform NooJ. Our proposed approach was tested on a large collection of Arabic TV news and experimental results were satisfactory.

Natural Language Processing for the Semantic Web

Book

Dec 2016

This book introduces core natural language processing (NLP) technologies to non-experts in an easily accessible way, as a series of building blocks that lead the user to understand key technologies, why they are required, and how to integrate them into Semantic Web applications. Natural language processing and Semantic Web technologies have different, but complementary roles in data management. Combining these two technologies enables structured and unstructured data to merge seamlessly. Semantic Web technologies aim to convert unstructured data to meaningful representations, which benefit enormously from the use of NLP technologies, thereby enabling applications such as connecting text to Linked Open Data, connecting texts to each other, semantic searching, information visualization, and modeling of user behavior in online networks. The first half of this book describes the basic NLP processing tools: tokenization, part-of-speech tagging, and morphological analysis, in addition to the main tools...

Decentralized subject indexing of television programs: The effects of using a semicontrolled indexing language

Article

Jun 2016

We performed an exploratory case study to understand how subject indexing performed by television production staff using a semicontrolled vocabulary affects indexing quality. In the study we used triangulation, combining tag analysis and semistructured interviews, with production staff of the Norwegian Broadcasting Corporation. The main findings reveal incomplete indexing of TV programs and their parts, in addition to low indexing consistency and uneven indexing exhaustivity. The informants expressed low motivation and a high level of uncertainty regarding the task. Internal guidelines and high domain knowledge among the indexers does not form a sufficient basis for creating quality and consistency in the vocabulary. The challenges that are revealed in the terminological analysis, combined with low indexing knowledge and lack of motivation, will create difficulties in the retrieval phase.

Using the Past to Explain the Present: Interlinking Current Affairs with Archives via the Semantic Web

Conference Paper

Oct 2013

The BBC has a very large archive of programmes, covering a wide range of topics. This archive holds a significant part of the BBC’s institutional memory and is an important part of the cultural history of the United Kingdom and the rest of the world. These programmes, or parts of them, can help provide valuable context and background for current news events. However the BBC’s archive catalogue is not a complete record of everything that was ever broadcast. For example, it excludes the BBC World Service, which has been broadcasting since 1932. This makes the discovery of content within these parts of the archive very difficult. In this paper we describe a system based on Semantic Web technologies which helps us to quickly locate content related to current news events within those parts of the BBC’s archive with little or no pre-existing metadata. This system is driven by automated interlinking of archive content with the Semantic Web, user validations of the resulting data and topic extraction from live BBC News subtitles. The resulting interlinks between live news subtitles and the BBC’s archive are used in a dynamic visualisation enabling users to quickly locate relevant content. This content can then be used by journalists and editors to provide historical context, background information and supporting content around current affairs.

Automated Metadata Enrichment of Large Speech Radio Archives

Article

Jan 2014

The British Broadcasting Corp. (BBC) manually tags recent programs on its website. Editors draw and assign these tags from open datasets made available within the Linked Data cloud, but this is a time-consuming process. Aside from recent programming, which is tagged, the BBC has a large radio archive that is untagged. Thus the possibility of automatically assigning tags to programs in a reasonable amount of time has been investigated. Tags enable a variety of use cases, such as dynamic building of topical aggregations, retrieval through topic-based search, or cross-domain navigation. Automatic tagging of archive content would ensure archive programs are as findable as recent programs. It would mean that topic-based collections of archive content can be easily built, for example, to find archive content that relates to current news events. This paper describes an infrastructure to process large program archives in a cost-effective and scalable manner using Amazon Web Services. An automated tagging algorithm using speech audio as an input is described. The paper also explains how this algorithm can be separated and distributed and how the workflow can be managed robustly, ensuring appropriate error handling, resource monitoring, and data management on a large scale. Finally, the results from processing the BBC World Service English-speaking audio archive are presented.

Scholarly Uses of TV Content: Bibliometric and Content Analysis of the Information Use Environment

Article

Jul 2015
J DOC

Purpose – The purpose of this paper is to investigate the use of television (TV) content for scholarly purposes. It focuses on: profile of scholars using TV content; the structure of their need for TV content; the situations in which scholars need TV content; and their patterns of use of TV content in each research stage. Design/methodology/approach – Taylor’s four components of the information use environment has contributed to the development of a conceptual framework. The data from the use of TV content by 668 scholars were profiled using correspondence analysis and co-word analysis. Additionally, the data from 15 interviews and content from 240 journal articles were analysed. Findings – The authors determined that the environment of the scholarly use of TV content is unique in terms of the scholars’ academic domains, research topics, motivation, and patterns of use. Six academic domains were identified as having used TV content to a meaningful degree, and their knowledge structure was presented as a map depicting the scholars’ needs for TV content. Scholars are likely to use TV content when they deal with timely social and cultural topics, or human behaviour. The scholars also showed different patterns of use of TV content at each stage of research. Originality/value – In this study, TV content was newly examined from the perspective of an information source for scholarly purposes, and it was found to be a meaningful source in several domains. This result extends the knowledge of information sources in scholarly communication and information services.

An information theoretic framework for web inference detection

Conference Paper

Oct 2012

Document redaction is widely used to protect sensitive information in published documents. In a basic redaction system, sensitive and identifying terms are removed from the document. Web-based inference is an attack on redaction systems whereby the redacted document is linked with other publicly available documents to infer the removed parts. Web-based inference also provides an approach for detecting unwanted inferences and so constructing secure redaction systems. Previous works on web-based inference used general keyword extraction methods for document representation. We propose a systematic approach, based on information theoretic concepts and measures, to rank the words in a document for purpose of inference detection. We extend our results to the case of multiple sensitive words and propose a metric that takes into account possible relationship of the sensitive words and results in an effective and efficient inference detection system. Using a number of experiments we show that our approach, when used for document redaction, substantially reduce the number of inferences that are left in a document. We describe our approach, present the experiment results, and outline future work.

Recent advances in affective and semantic media applications at the BBC

Conference Paper

Jul 2013

In this paper, we give an overview of recent BBC R&D work on automated affective and semantic annotations of BBC archive content, covering different types of use-cases and target audiences. In particular, after giving a brief overview of manual cataloguing practices at the BBC, we focus on mood classification, sound effect classification and automated semantic tagging. The resulting data is then used to provide new ways of finding or discovering BBC content. We describe two such interfaces, one driven by mood data and one driven by semantic tags.

A semi-automatic text-based semantic video annotation system for Turkish facilitating multilingual retrieval

Article

Jul 2013
EXPERT SYST APPL

It is commonly acknowledged that ever-increasing video archives should be conveniently indexed with the conveyed semantic information to facilitate later video retrieval. Domain-independent semantic video indexing is usually carried out through manual means which is too time-consuming and labor-intensive to be employed in practical settings. On the other hand, fully automated approaches are usually proposed for very specialized domains such as team sports videos. In this paper, we propose a generic text-based semi-automatic system for off-line semantic indexing and retrieval of news videos, since video texts such as speech transcripts stand as a plausible source of semantic information. The proposed system has a pipelined flow of execution where the sole manual intervention takes place during text extraction, yet it could execute in fully automated mode in case the associated video text is already available or a convenient text extractor is available to be incorporated into the system. At the core of the system is an information extraction component – a named entity recognizer – which extracts representative semantic information from the video texts. Based on the proposed generic system, a novel semantic annotation and retrieval system for Turkish is designed, implemented, and evaluated on two distinct news video data sets. By equipping it with the necessary components, the ultimate system is also turned into a multilingual video retrieval system and executed on a video data set in English, thereby facilitating multilingual semantic video retrieval.

Intelligent Mining and Indexing of Multi-language e-Learning Material

Chapter

Sep 2008

In this paper we describe a method to automatically discover important concepts and their relationships in e-Lecture material. The discovered knowledge is used to display semantic aware categorizations and query suggestions for facilitating navigation inside an unstructured multimedia repository of e-Lectures. We report about an implemented approach for dealing with learning materials referring to the same event in different languages. The information acquired from the speech is combined with the documents such as presentation slides, which are temporally synchronized with the video for creating new knowledge through a mapping with a taxonomy representation such as Wikipedia.

User-centered control within multimedia presentations

Article

Jan 2007

Dick Bulterman

The focus of much of the research on providing user-centered control of multimedia has been on the definition of models and (meta-data) descriptions that assist in locating or recommending media objects. While this can provide a more efficient means of selecting content, it provides little extra control for users once that content is rendered. In this article, we consider various means for supporting user-centered control of media within a collection of objects that are structured into a multimedia presentation. We begin with an examination of the constraints of user-centered control based on the characteristics of multimedia applications and the media processing pipeline. We then define four classes of control that can enable a more user-centric manipulation within media content. Each of these control classes is illustrated in terms of a common news viewing system. We continue with reflections on the impact of these control classes on the development of multimedia languages, rendering infrastructures and authoring systems. We conclude with a discussion of our plans for infrastructure support for user-centered multimedia control.

Crossing textual and visual content in different application scenarios

Article

Full-text available

Jan 2009

This paper deals with multimedia information access. We propose two new approaches for hybrid text-image information processing that can be straightforwardly generalized to the more general multimodal scenario. Both approaches fall in the trans-media pseudo-relevance feedback category. Our first method proposes using a mixture model of the aggregate components, considering them as a single relevance concept. In our second approach, we define trans-media similarities as an aggregation of monomodal similarities between the elements of the aggregate and the new multimodal object. We also introduce the monomodal similarity measures for text and images that serve as basic components for both proposed trans-media similarities. We show how one can frame a large variety of problem in order to address them with the proposed techniques: image annotation or captioning, text illustration and multimedia retrieval and clustering. Finally, we present how these methods can be integrated in two applications: a travel blog assistant system and a tool for browsing the Wikipedia taking into account the multimedia nature of its content.

Knowledge Elements Mining Subsystem of Knowledge Abstract and Fusion System

Conference Paper

Jan 2009

Knowledge develops and spreads rapidly in quality, quantity, profundity and extension. The key source of sustainable competitive advantage relies on the way to create, share, and utilize knowledge. This paper discusses the necessity of using knowledge elements for managing and representing knowledge, presents the framework of the main system, and describes the process of the knowledge elements mining subsystem and the main relationships of knowledge elements.

Improving automatic semantic annotations of news videos in Turkish through Web alignment and event extraction

Conference Paper

Jul 2011

Video texts - if available - constitute a valuable source for automatic semantic annotation of large video archives. In this paper, we present our attempts towards the improvement of a text-based semantic annotation and retrieval system for Turkish news videos through automatic Web alignment and event extraction. The results of our initial experiments turn out to be promising and these two features are incorporated into the existing system. Although the ideas of automatic Web alignment and text-based event extraction are not the novel contributions of the current paper, to the best of our knowledge, their first implementation and employment in a system for Turkish news videos is a significant contribution to related work on videos in lesser studied languages such as Turkish. Also overviewed in the current paper is the prospective version of the system encompassing components for several other tasks including topic segmentation, keyphrase extraction, news categorization and summarization to enhance the overall system.

A Two-Level Multi-Modal Approach for Story Segmentation of Large News Video Corpus

Article

Full-text available

Jan 2003

This paper presents an enhanced work from our previous paper (Chaisorn et al. 2002). The system is enhanced to perform news story segmentation on a large video corpus used in TRECVID 2003 evaluation. We use a combination of features include visual-based features such as color, object-based features such as face, video-text, temporal features such as audio and motion, and semantic feature such as cue-phrases. We employ Decision Tree and specific detectors to perform shot classification/tagging. We use the shot category information along with two temporal features to identify story boundaries using HMM (Hidden Markov Models). A heuristic rules-based technique is applied to classify each detected story into "news" or "misc".

Text Segmentation by Product Partition Models and Dynamic Programming

Article

Full-text available

Jan 2004
MATH COMPUT MODEL

In this paper, we use Barry and Hartigan's Product Partition Models to formulate text segmentation as an optimization problem, which we solve by a fast dynamic programming algorithm. We test the algorithm on Choi's segmentation benchmark and achieve the best segmentation results so far reported in the literature.

KIM – Semantic Annotation Platform

Conference Paper

Full-text available

Oct 2003

The KIM platform provides a novel Knowledge and Information Management infrastructure and services for automatic semantic annotation, indexing, and retrieval of documents. It provides mature infrastructure for scaleable and customizable information extraction (IE) as well as annotation and document management, based on GATE. In order to provide basic level of performance and allow easy bootstrapping of applications, KIM is equipped with an upper-level ontology and a knowledge base providing extensive coverage of entities of general importance. The ontologies and knowledge bases involved are handled using cutting edge Semantic Web technology and standards, including RDF(S) repositories, ontology middleware and reasoning. From technical point of view, the platform allows KIM-based applications to use it for automatic semantic annotation, content retrieval based on semantic restrictions, and querying and modifying the underlying ontologies and knowledge bases. This paper presents the KIM platform, with emphasize on its architecture, interfaces, tools, and other technical issues.

Coherent Keyphrase Extraction via Web Mining.

Conference Paper

Full-text available

Jan 2003

Peter David Turney

Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. A limitation of previous keyphrase extraction algorithms is that the selected keyphrases are occasionally incoherent. That is, the majority of the output keyphrases may fit together well, but there may be a minority that appear to be outliers, with no clear semantic relation to the majority or to each other. This paper presents enhancements to the Kea keyphrase extraction algorithm that are designed to increase the coherence of the extracted keyphrases. The approach is to use the degree of statistical association among candidate keyphrases as evidence that they may be semantically related. The statistical association is measured using web mining. Experiments demonstrate that the enhancements improve the quality of the extracted keyphrases. Furthermore, the enhancements are not domain-specific: the algorithm generalizes well when it is trained on one domain (computer science documents) and tested on another (physics documents).

Coherent Keyphrase Extraction via Web Mining

Article

Full-text available

Aug 2003

Peter David Turney

GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications

Conference Paper

Full-text available

Jul 2002

Latent Semantic Analysis for Text Segmentation

Article

Full-text available

Oct 2002

This paper describes a method for linear text segmentation that is more accurate or at least as accurate as state-of-the-art methods (Utiyama and Isahara, 2001; Choi, 2000a). Inter-sentence similarity is estimated by latent semantic analysis (LSA). Boundary locations are discovered by divisive clustering. Test results show LSA is a more accurate similarity measure than the cosine metric (van Rijsbergen, 1979).

Domain-Specific Keyphrase Extraction

Article

Full-text available

Jul 1999

Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process. This paper shows that a simple procedure for keyphrase extraction based on the naiveBayes learning scheme performs comparably to the state of the art. It goes on to explain how this procedure's performance can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves signi#cantly when domain-speci#c information is exploited. 1 Introduction Keyphrases give a high-level description of a document's contents that is intended to make it easy for prospective readers to decide whether or no...

Linear Segmentation and Segment Significance

Article

Full-text available

Dec 2002

We present a new method for discovering a segmental discourse structure of a document while categorizing each segment's function and importance. Segments are determined by a zero-sum weighting scheme, used on occurrences of noun phrases and pronominal forms retrieved from the document. Segment roles are then calculated from the distribution of the terms in the segment. Finally, we present results of evaluation in terms of precision and recall which surpass earlier approaches'.

Recognition, Indexing And Retrieval Of British Broadcast News With The Thisl System

Article

Full-text available

May 1999

This paper described the THISL spoken document retrieval system for British and North American Broadcast News. The system is based on the ABBOT large vocabulary speech recognizer and a probabilistic text retrieval system. We discuss the development of a realtime British English Broadcast News system, and its integration into a spoken document retrieval system. Detailed evaluation is performed using a similar North American Broadcast News system, to take advantage of the TREC SDR evaluation methodology. We report results on this evaluation, with particular reference to the effect of query expansion and of automatic segmentation algorithms. 1.INTRODUCTION THISL is an ESPRIT Long Term Research project in the area of speech retrieval. It is concerned with the construction of a system which performs good recognition of broadcast speech from television and radio news programmes, from which it can produce multimedia indexing data. The principal objective of the project is to construct a spo...

Towards Semantic Web Information Extraction

Article

Jan 2003

The approach towards Semantic Web Information Extraction (IE) presented here is implemented in KIM - a platform for semantic indexing, annotation, and retrieval. It combines IE based on the mature text engineering platform (GATE1) with Semantic Web-compliant knowledge representation and management. The cornerstone is automatic generation of named-entity (NE) annotations with class and instance references to a semantic repository. Simplistic upper-level ontology, providing detailed coverage of the most popular entity types (Person, Organization, Location, etc.; more than 250 classes) is designed and used. A knowledge base (KB) with de-facto exhaustive coverage of real-world entities of general importance is maintained, used, and constantly enriched. Extensions of the ontology and KB take care of handling all the lexical resources used for IE, most notable, instead of gazetteer lists, aliases of specific entities are kept together with them in the KB. A Semantic Gazetteer uses the KB to generate lookup annotations. Ontology- aware pattern-matching grammars allow precise class information to be handled via rules at the optimal level of generality. The grammars are used to recognize NE, with class and instance information referring to the KIM ontology and KB. Recognition of identity relations between the entities is used to unify their references to the KB. Based on the recognized NE, template relation construction is performed via grammar rules. As a result of the latter, the KB is being enriched with the recognized relations between entities. At the final phase of the IE process, previously unknown aliases and entities are being added to the KB with their specific types.

KIM - A semantic platform for information extraction and retrieval

Article

Sep 2004
Nat Lang Eng

The KIM platform provides a novel Knowledge and Information Management framework and services for automatic semantic annotation, indexing, and retrieval of documents. It provides a mature and semantically enabled infrastructure for scalable and customizable information extraction (IE) as well as annotation and document management, based on GATE. 1 Our understanding is that a system for semantic annotation should be based upon a simple model of real-world entity concepts, complemented with quasi-exhaustive instance knowledge. To ensure efficiency, easy sharing, and reusability of the metadata we introduce an upper-level ontology. Based on the ontology, a large-scale instance base of entity descriptions is maintained. The knowledge resources involved are handled by use of state-of-the-art Semantic Web technology and standards, including RDF(S) repositories, ontology middleware and reasoning. From a technical point of view, the platform allows KIM-based applications to use it for automatic semantic annotation, for content retrieval based on semantic queries, and for semantic repository access. As a framework, KIM also allows various IE modules, semantic repositories and information retrieval engines to be plugged into it. This paper presents the KIM platform, with an emphasis on its architecture, interfaces, front-ends, and other technical issues.

Media Augmentation and Personalization Through Multimedia Processing and Information Extraction

Chapter

Jan 2004

This chapter details the value and methods for content augmentation and personalization among different media such as TV and Web. We illustrate how metadata extraction can aid in combining different media to produce a novel content consumption and interaction experience. We present two pilot content augmentation applications. The first, called MyInfo, combines automatically segmented and summarized TV news with information extracted from Web sources. Our news summarization and metadata extraction process employs text summarization, anchor detection and visual key element selection. Enhanced metadata allows matching against the user profile for personalization. Our second pilot application, called InfoSip, performs person identification and scene annotation based on actor presence. Person identification relies on visual, audio, text analysis and talking face detection. The InfoSip application links person identity information with filmographies and biographies extracted from the Web, improving the TV viewing experience by allowing users to easily query their TVs for information about actors in the current scene.

Statistical Models for Text Segmentation

Article

Feb 1999

This paper introduces a new statistical approach to automatically partitioning text into coherent segments. The approach is based on a technique that incrementally builds an exponential model to extract features that are correlated with the presence of boundaries in labeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel way to detect broad changes of topic, and cue-word features that detect occurrences of specific words, which may be domain-specific, that tend to be used near segment boundaries. Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts. Quantitative results on these domains are presented using a new probabilistically motivated error metric, which combines precision and recall in a natural and flexible way. This metric is used to make a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and previously proposed text segmentation algorithms.

Semantic Annotation, Indexing, and Retrieval

Article

Dec 2004
J WEB SEMANT

The Semantic Web realization depends on the availability of a critical mass of metadata for the web content, associated with the respective formal knowledge about the world. We claim that the Semantic Web, at its current stage of development, is in a state of a critically need of metadata generation and usage schemata that are specific, well-defined and easy to understand. This paper introduces our vision for a holistic architecture for semantic annotation, indexing, and retrieval of documents with regard to extensive semantic repositories. A system (called KIM), implementing this concept, is presented in brief and it is used for the purposes of evaluation and demonstration. A particular schema for semantic annotation with respect to real-world entities is proposed. The underlying philosophy is that a practical semantic annotation is impossible without some particular knowledge modelling commitments. Our understanding is that a system for such semantic annotation should be based upon a simple model of real-world entity classes, complemented with extensive instance knowledge. To ensure the efficiency, ease of sharing, and reusability of the metadata

Indexing and retrieval of broadcast news

Article

Sep 2000
SPEECH COMMUN

This paper describes a spoken document retrieval (SDR) system for British and North American Broadcast News. The system is based on a connectionist large vocabulary speech recognizer and a probabilistic information retrieval system. We discuss the development of a realtime Broadcast News speech recognizer, and its integration into an SDR system. Two advances were made for this task: automatic segmentation and statistical query expansion using a secondary corpus. Precision and recall results using the Text Retrieval Conference (TREC) SDR evaluation infrastructure are reported throughout the paper, and we discuss the application of these developments to a large scale SDR task based on an archive of British English broadcast news.

Topic-based document segmentation with probabilistic latent semantic analysis

Conference Paper

Nov 2002

This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems.

Automated transcription and topic segmentation of large spoken archives

Conference Paper

Sep 2003

Digital archives have emerged as the pre-eminent method for capturing the human experience. Before such archives can be used efficiently, their contents must be described. The scale of such archives along with the associated con- tent mark up cost make it impractical to provide access via purely manual means, but automatic technologies for search in spoken materials still have relatively limited capabilities. The NSF-funded MALACH project will use the world's largest digital archive of video oral histories, collected by the Survivors of the Shoah Visual History Foundation (VHF) to make a quantum leap in the ability to access such archives by advancing the state-of-the-art in Automated Speech Recognition (ASR), Natural Language Processing (NLP) and related technologies (1, 2). This corpus consists of over 115,000 hours of unconstrained, natural speech from 52,000 speakers in 32 different languages, filled with disfluencies, heavy accents, age-related coarticulations, and un-cued speaker and language switching. Thispaper discusses some of theASR and NLPtools and technologies that we have been building for the English speech in the MALACH corpus. We also discuss this new test bed while emphasizing the unique characteristics of this corpus.

A New Probabilistic Model for Title Generation

Article

Mar 2003

Title generation is a complex task involving both natural language understanding and natural language synthesis. In this paper, we propose a new probabilistic model for title generation. Different from the previous statistical models for title generation, which treat title generation as a generation process that converts the `document representation' of information directly into a `title representation' of the same information, this model introduces a hidden state called `information source' and divides title generation into two steps, namely the step of distilling the `information source' from the observation of a document and the step of generating a title from the estimated `information source'. In our experiment, the new probabilistic model outperforms the previous model for title generation in terms of both automatic evaluations and human judgments.

Advances in Domain Independent Linear Text Segmentation

Article

May 2002

Freddy Y. Y. Choi

This paper describes a method for linear text seg- mc. ntation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering.

Text Segmentation And Topic Tracking On Broadcast News Via A Hidden Markov Model Approach

Article

Oct 2000

Continuing progress in the automatic transcription of broadcast speech via speech recognition has raised the possibility of applying information retrieval techniques to the resulting (errorful) text. In this paper we describe a general methodology based on Hidden Markov Models and classical language modeling techniques for automatically inferring story boundaries (segmentation) and for retrieving stories relating to a specific topic (tracking). We will present in detail the features and performance of the Segmentation and Tracking systems submitted by Dragon Systems for the 1998 Topic Detection and Tracking evaluation. 1. INTRODUCTION Over the last few years Dragon, like a number of other research sites, has been developing a speech recognition system capable of automatically transcribing broadcast speech. With the recent advances in this technology, a new source is becoming available for information mining, in the form of a continuous stream of errorful, unsegmented text. Applying s...

Hub-4 Information Extraction Evaluation

Article

Aug 2000

This paper documents the Information Extraction Named-Entity Evaluation (IE-NE), one of the new spokes added to the DARPA-sponsored 1998 Hub-4 Broadcast News Evaluation. This paper discusses the information extraction task as posed for the 1998 Broadcast News Evaluation. This paper reviews the evaluation metrics, the scoring process, and the test corpus that was used for the evaluation. Finally, this paper reviews the results of the first running of a Hub-4 IE-NE Evaluation. The Baseline IE-NE evaluation, in which BBN's IdentiFinder was run on the primary system transcripts submitted for the Hub-4 Broadcast News evaluation, found that the transcripts generated by LIMSI's automatic speech recognition system produced the "highest" F-measure score (82.39). In the Quasi IE-NE evaluation, where sites ran their own NEtaggers on a set of three baseline recognizer transcripts, the SRI developed tagger achieved the highest F-measure score for baseline recognizers 1 & 3, while the BBN develop...

The use of recurrent networks in continuous speech recognition Automatic speech and speaker recognition – advanced topics

Jan 1996
233-258

T Robinson
M Hochberg
S Renals

Robinson, T., Hochberg, M. and Renals, S. The use of recurrent networks in continuous speech recognition. In C. H. Lee, K. K. Paliwal and F. K. Soong (Eds.), Automatic speech and speaker recognition – advanced topics, 233-258, Kluwer Academic Publishers, Boston, 1996.

Web-assisted annotation, semantic indexing and search of television and radio news

Abstract

No full-text available

Recommended publications

Spoken Query Processing for Information Retrieval