Conference Paper

Web-assisted annotation, semantic indexing and search of television and radio news

Authors:
  • Ieso Digital Health
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The Rich News system, that can automatically annotate radio and television news with the aid of resources retrieved from the World Wide Web, is described. Automatic speech recognition gives a temporally precise but conceptually inaccurate annotation model. Information extraction from related web news sites gives the opposite: conceptual accuracy but no temporal data. Our approach combines the two for temporally accurate conceptual semantic annotation of broadcast news. First low quality transcripts of the broadcasts are produced using speech recognition, and these are then automatically divided into sections corresponding to individual news stories. A key phrases extraction component finds key phrases for each story and uses these to search for web pages reporting the same event. The text and meta-data of the web pages is then used to create index documents for the stories in the original broadcasts, which are semantically annotated using the KIM knowledge management platform. A web interface then allows conceptual search and browsing of news stories, and playing of the parts of the media files corresponding to each news story. The use of material from the World Wide Web allows much higher quality textual descriptions and semantic annotations to be produced than would have been possible using the ASR transcript directly. The semantic annotations can form a part of the Semantic Web, and an evaluation shows that the system operates with high precision, and with a moderate level of recall.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... News users: More than half the main papers aim to offer news services to the general public. An early example is Rich News [11], a system that automatically transcribes and segments radio and TV streams. Key phrases extracted from each segment are used to retrieve web pages that report the same news event. ...
... news articles (59), news feeds (23), RSS feeds (17), KG (11), social media (10), multimedia (8), Twitter (6), TV news (4), user histories (4), news metadata (3) Life-cycle phase: ...
... Semantic exchange formats: RDF (43), OWL (28), SPARQL (25), KG (18), RDFS (12) • Semantic ontologies and vocabularies: FOAF (6) (2), ESO (2), GAF (2) • Semantic information resources: domain ontology (31), DBpedia (23), LOD (14), Freebase (9), Wikidata (9), GeoNames (7), Google KG (3), YAGO (3), OpenCyc (2), ConceptNet (2) • Semantic processing techniques: entity linking (32), Jena (12), reasoning (7), inference (6), DBpedia Spotlight (5), OpenCalais (4), description logic (4) (2), PropBank (2), FrameNet (2) • Other processing techniques (language): entity extraction (36), NL pre-processing (33), coreference resolution (11), GATE (10), Lucene (7), spaCy (7), JAPE (6), morphological analysis (6) Table 6 shows the conceptual framework that results from populating our analysis framework in Table 1 with the most frequently used sub-themes from the analysis. It is organised in a hierarchy of depth up to 4 (e.g., Other techniques: → Other resources → language → WordNet). ...
Article
Full-text available
ICT platforms for news production, distribution, and consumption must exploit the ever-growing availability of digital data. These data originate from different sources and in different formats; they arrive at different velocities and in different volumes. Semantic knowledge graphs (KGs) is an established technique for integrating such heterogeneous information. It is therefore well-aligned with the needs of news producers and distributors, and it is likely to become increasingly important for the news industry. This paper reviews the research on using semantic knowledge graphs for production, distribution, and consumption of news. The purpose is to present an overview of the field; to investigate what it means; and to suggest opportunities and needs for further research and development.
... Due to this diverse structure and content the challenge here is how to choose and customise the ontology learning methods, so that they can achieve the best possible results with minimum human intervention. Another aspect that is worth considering here is whether some knowledge is easier to acquire from only some of these sources (e.g., key terms from the source code comments), and then combine this newly acquired knowledge with information from the other sources (for an application of this approach in multimedia indexing see [11]). Static vs dynamic: As software tends to go through versions or releases, i.e., evolve over time, the majority of software-related datasources tend to change over time, albeit some more frequently than others. ...
... Our multi-source ontology learning system uses the language processing facilities provided by GATE itself [8, 3, 11] and we have modified or extended some of them specifically for the problem of learning from software artifacts. Note that GATE plays a dual role in our research – both as one of the software projects used for experimenting with our technology and also as the language processing software infrastructure, which we used for building the technology itself. ...
... The goal is to derive the same term from the singular and plural forms, instead of two different terms. The third component is the GATE key phrase extractor [11], which is based on TF.IDF (term frequency/inverted document frequency). This method looks for phrases that occur more frequently in the text under consideration than they do in language as a whole. ...
Article
Full-text available
While early efforts on applying Semantic Web technologies to solve software engineering related problems show promising results, the very basic process of augmenting software artifacts with their se-mantic representations is still an open issue. Indeed, exiting techniques to learn ontologies that describe the domain of a certain software project either 1) explore only one information source associated to this project or 2) employ supervised and domain specific techniques. In this paper we present an ontology learning approach that 1) exploits a range of information sources associated with software projects and 2) relies on techniques that are portable across application domains.
... Likewise, the news from newspaper publishers either the softcopy or the hardcopy. With the web, softcopy so easily spread news to the joints of the social, through online media will be so easily absorbed by the heart of social and subsequently affect the social life quickly [15]. Web allows the presentation of information that can be read, heard, and seen in parallel. ...
... In line with the needs of information, studies on the web has given the progress that the web is becoming the smart documents, namely "documents know about itself well" [18]. The web has become the core of the internet, therefore, studies continue to be done to improve the ability of the web as a place of information, the facilities in web coupled with the implementation of the W3C (World Wide Web Consortium): an organization that assesses and standardize the attributes for structuring the information [15]. ...
Article
Full-text available
In the social world there are many issues: positive or negative. The negative issues affect the level of social comfort. On social media such as Web, every issue positioned based on the document, which has its own attributes, such as the URL address and date of creation. Not easy to extract information from the Web, as well as to determine the origin of an issue that is flowing in the web. This paper is to derive a method for revealing the origin of an issue based on the characteristics of each webpage.
... The THISL system [1] applies an automated speech recognition system (ABBOT) on BBC news broadcasts and uses a bag-of-words model on the resulting transcripts for programme retrieval. The Rich News system [5] also uses ABBOT for speech recognition. It then segments the transcripts using bag-of-words similarity between consecutive segments using Choi's C99 algorithm [3]. ...
... The further away a common ancestor between two categories is, the lower the cosine similarity between those two categories will be. We implemented such a vector space model within our 5 RDFSim project 6 . We consider a vector in that space for each DBpedia web identifier, corresponding to a weighted sum of all the categories attached to it. ...
Article
Full-text available
The BBC is currently tagging programmes manually, using DBpedia as a source of tag identifiers, and a list of suggested tags extracted from the programme synopsis. These tags are then used to help navigation and topic-based search of programmes on the BBC website. However, given the very large number of programmes available in the archive, most of them having very little metadata attached to them, we need a way to automatically assign tags to programmes. We describe a framework to do so, using speech recognition, text processing and concept tagging techniques. We describe how this framework was successfully applied to a very large BBC radio archive. We demonstrate an application using automatically extracted tags to aid discovery of archive content.
... GAMPS falling in this class are language specific, so that two SA_GAMPs have been designed for Italian and English information extraction respectively. In the following the Italian SA_GAMP will be used as the reference example for the discussion, while technical details of the English ones are found in [17]. ...
... Finally, a module using an Ontology to annotate the news item is applied (in PrestoSpace, the KIM platform [13] has been used). More details on the Italian SA GAMP can be found in [11], while the English SA GAMP is discussed in detail in [17]. KIM will be further discussed in section 3.2. ...
Conference Paper
Full-text available
This paper will present the contribution of the European PrestoSpace project to the study and development of a Metadata Access and Delivery (MAD) platform for multimedia and television broadcast archives. The MAD system aims at generating, validating and delivering to archive users metadata created by automatic and semi- automatic information extraction processes. The MAD publication platform employs audiovisual content analysis, speech recognition (ASR) and semantic analysis tools. It then provides intelligent facilities to access the imported and newly produced metadata. The possibilities opened by the PrestoSpace framework to intelligent indexing and retrieval of multimedia objects within large scale archives apply as well to more general scenarios where semantic information is needed to cope with the complexity of the search process.
... In [MBS08] Messina et al. describe tools for content-based analysis, e. g., scene-cut detection, speech to text transcription and keyframe-extraction. Moreover, named entities and categories from the BBC's program web-pages were also extracted and mapped to the proton upperontology [DTCP05]. Mediaglobe complements Prestospace's efforts, as high-level abstraction layer analysis technologies, i. e. visual concept detection was not in the scope of Prestospace. ...
... Therefore, background knowledge has to be aggregated that is maintained independently of the original source, i. e., the film archive. In contrast to [DTCP05,MBD06] we access the DBpedia as multilingual knowledge base to allow crosslingual search and annotation with entities that are steadily curated by the Wikipedia community. The Europeana 5 aggregates video data including supporting documents from various film archives in Europe. ...
... Rich News is essentially an application-independent annotation system, there being many potential uses for the semantic annotations it produces. The first use proposed for the Rich News system was to automate the annotation of BBC news pro- grammes [9] . For more than twenty years the BBC have been semantically annotating their news in terms of a taxonomy called Lonclass, which was derived from the Universal Decimal Classification system commonly used by libraries to classify books [3]. ...
... If this page matches sufficiently closely, it is associated with the story, and used to derive title, summary and section annotations. An evaluation of an earlier version of Rich News [9] found that 92.6% of the web pages found in this way reported the same story as that in the broadcast, while the remaining ones reported closely related stories. That version of Rich News searched only the BBC website, and was successful in finding web pages for 40% of the stories, but the addition of multiple news sources, and an improved document matching component, can be expected to have raised both the precision and recall of the system, though no formal evaluation of the current system has been conducted. ...
Article
The Rich News system for semantically annotating television news broadcasts and augmenting them with additional web content is described. Online news sources were mined for material reporting the same stories as those found in television broadcasts, and the text of these pages was semantically an-notated using the KIM knowledge management platform. This resulted in more effective indexing than would have been possible if the programme transcript was indexed directly, owing to the poor quality of transcripts produced using automatic speech recognition. In addition, the associations produced between web pages and television broadcasts enables the automatic creation of aug-mented interactive television broadcasts and multi-media websites.
... In addition to enabling catching-up users to see live-tweets and speculations posted by other users who watched the episode in sync with the broadcast, we can design and develop interactions that only catching-up users can appreciate. Inspired by the recent proposals for story-based retrievals from TV shows [57,71] and comics [37,42] or annotations [16,49], we suggest that it may be helpful to guide catching-up users to be attentive by showing them a subtle alert during scenes about which others have frequently speculated. This can be identified through an analysis of speculation-tweets. ...
Article
Full-text available
A growing number of people are using catch-up TV services rather than watching simultaneously with other audience members at the time of broadcast. However, computational support for such catching-up users has not been well explored. In particular, we are observing an emerging phenomenon in online media consumption experiences in which speculation plays a vital role. As the phenomenon of speculation implicitly assumes simultaneity in media consumption, there is a gap for catching-up users, who cannot directly appreciate the consumption experiences. This conversely suggests that there is potential for computational support to enhance the consumption experiences of catching-up users. Accordingly, we conducted a series of studies to pave the way for developing computational support for catching-up users. First, we conducted semi-structured interviews to understand how people are engaging with speculation during media consumption. As a result, we discovered the distinctive aspects of speculation-based consumption experiences in contrast to social viewing experiences sharing immediate reactions that have been discussed in previous studies. We then designed two prototypes for supporting catching-up users based on our quantitative analysis of Twitter data in regard to reaction- and speculation-based media consumption. Lastly, we evaluated the prototypes in a user experiment and, based on its results, discussed ways to empower catching-up users with computational supports in response to recent transformations in media consumption.
... In addition to enabling catching-up users to see live-tweets and speculations posted by other users who watched the episode in sync with the broadcast, we can design and develop interactions that only catching-up users can appreciate. Inspired by the recent proposals for story-based retrievals from TV shows [57,71] and comics [37,42] or annotations [16,49], we suggest that it may be helpful to guide catching-up users to be attentive by showing them a subtle alert during scenes about which others have frequently speculated. This can be identified through an analysis of speculation-tweets. ...
Preprint
Full-text available
A growing number of people are using catch-up TV services rather than watching simultaneously with other audience members at the time of broadcast. However, computational support for such catching-up users has not been well explored. In particular, we are observing an emerging phenomenon in online media consumption experiences in which speculation plays a vital role. As the phenomenon of speculation implicitly assumes simultaneity in media consumption, there is a gap for catching-up users, who cannot directly appreciate the consumption experiences. This conversely suggests that there is potential for computational support to enhance the consumption experiences of catching-up users. Accordingly, we conducted a series of studies to pave the way for developing computational support for catching-up users. First, we conducted semi-structured interviews to understand how people are engaging with speculation during media consumption. As a result, we discovered the distinctive aspects of speculation-based consumption experiences in contrast to social viewing experiences sharing immediate reactions that have been discussed in previous studies. We then designed two prototypes for supporting catching-up users based on our quantitative analysis of Twitter data in regard to reaction- and speculation-based media consumption. Lastly, we evaluated the prototypes in a user experiment and, based on its results, discussed ways to empower catching-up users with computational supports in response to recent transformations in media consumption.
... This system also utilizes an ASR tool to obtain the video texts and IE techniques (named entities recognition). Another semantic video annotation application called Rich News has been described in [8], where the authors make use of the resources on the web to enhance the indexing process. The overall system contains the following modules: automatic speech recognition, key-phrase extraction from the speech transcripts and searching the video using key phrases. ...
... Several attempts have been carried out, such as the THISL system [8] which used ABBOT to automate speech recognition on the BBC news broadcasts and the Rich News system [9], which used also ABBOT for the speech recognition. Then transcripts were segmented using matching bag-of-words between consecutive segments using the Choi C99 algorithm [10]. ...
Conference Paper
Many organizations have attempted to automatically exploit the data embedded in web pages and enrich the web with a semantic dimension. Data should follow the principles outlined by Tim Berners Lee, which is based on traditional web technologies, such as, Uniform Resource Identifier (URI) and Hypertext Transmission Protocol (HTTP), and semantic web technologies, including knowledge representation languages, such as, Resource Description Framework (RDF), as well as links to other data. The uses of linked data technology are numerous and varied. The case of the British Broadcasting Corporation (BBC) is the most widely reported success story of linked data technology usage in the literature. This success stems from the use of linked data in the BBC web portal, which enables the site to present rich content that is automatically updated from linked data cloud. The aim of this study is to analyze the literature relating to this case study to derive approaches and technologies for linked data usage and propose a group of best practices, as well as a generic approach that can be used by web developers.
... In Section 5 we present two methods for acquiring annotations, obtained from two main sources of annotators: linguists, and non-experts. For the expert linguists, we have developed a wiki-like platform from scratch, because existing annotation systems (e.g., GATE [26], NITE [18], or UIMA [31]) do not offer the functionalities required for deep semantic annotation. For the non-experts, we introduce a crowd-sourcing method based on gamification. ...
Chapter
Full-text available
The goal of the Groningen Meaning Bank (GMB) is to obtain a large corpus of English texts annotated with formal meaning representations. Since manually annotating a comprehensive corpus with deep semantic representations is a hard and time-consuming task, we employ a sophisticated bootstrapping approach. This method employs existing language technology tools (for segmentation, part-of-speech tagging, named entity tagging, animacy labelling, syntactic parsing, and semantic processing) to get a reasonable approximation of the target annotations as a starting point. The machine-generated annotations are then refined by information obtained from both expert linguists (using a wiki-like platform) and crowd-sourcing methods (in the form of a 'Game with a Purpose') which help us in deciding how to resolve syntactic and semantic ambiguities. The result is a semantic resource that integrates various linguistic phenomena, including predicate-argument structure , scope, tense, thematic roles, rhetorical relations and presuppositions. The semantic formalism that brings all levels of annotation together in one meaning representation is Discourse Representation Theory, which supports meaning representations that can be translated to first-order logic. In contrast to ordinary treebanks, the units of annotation in the GMB are texts, rather than isolated sentences. The current version of the GMB contains more than 10,000 public domain texts aligned with Discourse Representation Structures, and is freely available for research purposes.
... Named entity recognition (NER) is a task in information extraction (IE) which consists of identifying and classifying just some types of information elements, called named entities (NE) (Marrero et al. 2013). It is employed as the basis for many other important areas in information management, such as information retrieval (Mihalcea and Moldovan 2001), automatic summarization (Lee et al. 2003) and semantic multimedia annotation (Dowman et al. 2005;Saggion et al.;2004) in some domains such as biomedical texts (Tsai et al. 2006), business information documents (Sung and Chang 2004), and financial documents (Seng and Lai 2010). ...
Article
Full-text available
Named entity recognition (NER) is an information extraction subtask that attempts to recognize and categorize named entities in unstructured text into predefined categories such as the names of people, organizations, and locations. Recently, machine learning approaches, such as hidden Markov model (HMM) as well as hybrid methods, are frequently used to solve Name Entity Recognition. To the best of our knowledge, publicly available data sets for NER in Persian do not exist in any machine learning-based Persian NER system. Because of HMM innate weaknesses, in this paper, we have used both hidden Markov model and rule-based method to recognize named entities in Persian texts. The combination of rule-based method and machine learning method results in a high accurate recognition. The proposed system in its machine learning section uses HMM and Viterbi algorithms, and in its rule-based section employs a set of lexical resources and pattern bases for the recognition of named entities including the names of people, locations and organizations. During this study, we annotate our own training and testing data sets for use in the related phases. Our hybrid approach performs on Persian language with 89.73% precision, 82.44% recall, and 85.93% F-measure using an annotated test corpus including 32,606 tokens.
... This was traditionally accomplished through the use of metadata, but has been replaced with semantic annotations based on domain ontologies (Berners-Lee et al., 2001). The advantages of such annotations are quicker search and retrieval of documents, the automation of several web-based activities, etc. (Gardenfors, 2004;Frienland et al., 2004;Dowman et al., 2005;Rinaldi et al., 2004;Plessers et al., 2005;Maynard et al., 2004;Hunter et al., 2004). Different methods have been employed to annotated knowledge assets; these are comprehensively tackled in Uren et al. (2005). ...
Article
Full-text available
Insider attack and espionage on computer-based information is a major problem for business organizations and governments. Knowledge Management Systems (KMSs) are not exempt from this threat. Prior research presented the Congenial Access Control Model (CAC), a relationship-based access control model, as a better access control method for KMS because it reduces the adverse effect of stringent security measures on the usability of KMSs. However, the CAC model, like other models, e.g., Role Based Access Control (RBAC), Time-Based Access Control (TBAC), and History Based Access Control (HBAC), does not provide adequate protection against privilege abuse by authorized users that can lead to industrial espionage. In this paper, the authors provide an Espionage Prevention Model (EP) that uses Semantic web-based annotations on knowledge assets to store relevant information and compares it to the Friend-Of-A-Friend (FOAF) data of the potential recipient of the resource. It can serve as an additional layer to previous access control models, preferably the Congenial Access Control (CAC) model.
... Named Entity Recognition (NER) is a task in Information Extraction (IE) consisting of identifying and classifying just some types of information elements, called Named Entities (NE) [1]. It employed as the basis for many other important areas in Information Management, such as information retrieval [2], automatic summarization [3] and semantic multimedia annotation [4, 5] in some domains such as biomedical texts [6], business information documents [7], and financial documents [8]. Named entity recognition is one of the main and important information extraction subtasks and is defined as the recognition of names of people, locations, organizations as well as temporal and numeric expressions [9] . ...
Conference Paper
Full-text available
Named Entity Recognition (NER) is an information extraction subtask that attempts to recognize and categorize named entities in unstructured text into predefined categories such as the names of people, organizations, and locations. Recently, machine learning approaches, such as Hidden Markov Model (HMM) as well as hybrid methods, are frequently used to solve Name Entity Recognition. Since the absence of publicly available data sets for NER in Persian, as our knowledge does not exist any machine learning base Persian NER system. Because of HMM innate weaknesses, in this paper, we have used both Hidden Markov Model and rule-based method to recognize named entities in Persian texts. The combination of rule-based method and machine learning method results in a high accurate recognition. The proposed system in is machine learning section uses from HMM and Viterbi algorithms; and in its rule-based section employs a set of lexical resources and pattern bases for the recognition of named entities including the names of people, locations and organizations. During this study, we annotate our own training and testing data sets to use in the related phases. Our hybrid approach performs on Persian language with 89.73% precision, 82.44% recall, and 85.93% F-measure using an annotated test corpus including 32,606 tokens.
... The use of knowledge embodied in annotation is being investigated in domains as diverse as scientific knowledge [10], radio and television news [11], genomics [12], making web pages accessible to visually impaired people [13] and the description of cultural artifacts in museums [14]. ...
Conference Paper
Full-text available
Every day thousands of news articles are published in Bangla from several different sources on the web and this number is even increasing rapidly. On the contrary, the readers are often selective to read their desired news only. In this connection, classical Information Extraction (IE) techniques are used to query with keywords from unstructured or semi-structured news contents to fulfill partial requirements. However, they cannot interpret sequences of events, relation among entities, inference some unveiled facts to facilitate further human analysis. To achieve this goal, semantic technology adds formal structure and semantics to the news stream. In this paper, we propose a system to analyze Bangla news content to annotate especially things, people and places with semantic technology automatically by extracting what happened, when, where and who being involved in the news with the help of classical Natural Language Processing (NLP) techniques. Furthermore, we relate news of today with the previous news to accumulate information over time. We present our proposed system of annotating Bangla news semantically and experiment with SPARQL to inference integrated news from different sources over time and shows its effectiveness in querying specific information.
... al. [13,27] evaluate the precision and recall of annotations types (elements in our second-level grammar) rather than actual results of semantic search. In subsequent research, a search evaluation on television and radio news articles was conducted in [28] using KIM, based on ontology and keyword-based query interpretation, not the rules developed in the system. ...
Article
While contemporary semantic search systems offer to improve classical keyword-based search, they are not always adequate for complex domain specific information needs. The domain of prescription drug abuse, for example, requires knowledge of both ontological concepts and “intelligible constructs not typically modeled in ontologies. These intelligible constructs convey essential information that include notions of intensity, frequency, interval, dosage and sentiments, which could be important to the holistic needs of the information seeker. In this paper, we present a hybrid approach to domain specific information retrieval (or knowledge-aware search system) that integrates ontology-driven query interpretation with synonym-based query expansion and domain specific rules, to facilitate search in social media. Our framework is based on a context-free grammar (CFG) that defines the query language of constructs interpretable by the search system. The grammar provides two levels of semantic interpretation: 1) a top-level CFG that facilitates retrieval of diverse textual patterns, which belong to broad templates and 2) a low-level CFG that enables interpretation of certain specific expressions that belong to such patterns. These low-level expressions occur as concepts from four different categories of data: 1) ontological concepts, 2) concepts in lexicons (such as emotions and sentiments), 3) concepts in lexicons with only partial ontology representation, called lexico-ontology concepts (such as side effects and routes of administration (ROA)), and 4) domain specific expressions (such as date, time, interval, frequency and dosage) derived solely through rules. Our approach is embodied in a novel Semantic Web platform called PREDOSE, which provides search support for complex domain specific information needs in prescription drug abuse epidemiology. When applied to a corpus of over 1 million drug abuse-related web forum posts, our search framework proved effective in retrieving relevant documents when compared with three existing search systems.
... The famous TF-IDF concept is used for selecting keywords [10]. This method looks for words that occur frequently in the text under consideration, but ignores words that occur in the whole text. ...
Article
Recently, digital based systems such as a networked production system and a video archive system have been in construction inside a broadcasting station for efficient multimedia retrieval and management. While the importance of metadata cannot be overemphasized for multimedia retrieval service, it is very difficult to generate semantic metadata (e.g. titles, keywords, character's names, etc.) that are useful in broadcasting field through a pure audio-visual signal processing. The goal of our project is to develop a technology for generating semantic metadata from broadcast content using speech/text/face recognition methods for practical usage. The speech and text recognition engine optimised for news programs have been developed and integrated into data summarizing module together with the face recognition engine (introduced at IBC 2004). The recognition result extracted from these engines are merged and summarized based on the importance of words and TF-IDF concept to generate the semantic metadata for each scene of news. To evaluate the performance of the developed engine, the Automatic Metadata Generator software -OMEGA has been implemented. We have experimented it with a commercial MAM system to show the usefulness of our approach that is based on recognition and data summarization technology.
... – Obtener vínculos a recursos mediante la utilización de los servicios seleccionados en la etapa anterior. Para realizar este objetivo resultará necesario definir una estrategia de búsqueda, como ser la creación de queries automáticos, la búsqueda en listas u ontologías, la consulta de enciclopedias on-line, etc. Los trabajos de Janevski y Dimitrova [Janevski, 2002] y Dowman et al [Dowman, 2005] para el enriquecimiento de video son ejemplos interesantes de cómo puede lograrse tal extracción de información. ...
Article
Full-text available
Resumen Se presenta un proyecto actualmente en desarrollo cuyo objetivo es la creación de un modelo de enriquecimiento de textos basado en la integración de recursos disponibles en el espacio web. El modelo propuesto pretende transformar textos planos lineales en hipertextos que provean información y recursos multimedia sobre entidades reconocidas. Con esta aplicación los usuarios podrán transformar textos en hipertextos auto-explicativos que les posibilitarán una mayor comprensión y les ahorrarán realizar búsquedas individuales de información afín. La evolución al concepto de web 2.0 y la proliferación y popularización de buscadores alternativos, blogs, wikis, servicios de tagging, de question/answering, etc. resultan ideales para explotar de manera eficiente los recursos que provee Internet y utilizarlos estratégicamente en el enriquecimiento de texto. Palabras clave: hipertexto, reconocimiento de entidades, enriquecimiento de texto, content augmentation.
... While there have been attempts to apply semantic annotation tools to multimedia data (e.g. news videos [DTCP05]), the approaches tend to be domain and application-specific and thus need to be developed further prior to being applied to software artefacts, such as screen shots, training videos, and software specifications. ...
Article
EU-IST Strategic Targeted Research Project (STREP) IST-2004-026460 TAO Deliverable D3.1 (WP3) This deliverable is concerned with developing algorithms and tools for semantic annotation of legacy software artefacts, with respect to a given domain ontology. In the case of non-textual content, e.g., screen shots and design diagrams we have applied OCR software, prior to Informa-tion Extraction. The results have been made available as a web service, which is in the process of being refined and integrated within the TAO Suite.
... Focusing on these problems, there are many works, say some [1][2][3][4][5][6][7][8][9], present some search methods. All the suggested methods, for example, the similarity search methods, latent semantic indexing and conceptual word-chains, are make much progress relative to the original keyword matching mode. ...
Article
Full-text available
In this paper, an overall framework on paper retrieval system based on paper's connotation is proposed. Paper database is sorted into four ranks. Each of them is mainly described by an extended keyword set, which is served as the carrier of the precise connotation of a paper. Based on the matching degree between the paper introduction's vocabulary and the extended keywords set of the topics, papers in those topics is selected by using fuzzy rules. So the papers whose connotation approximates to the user's interest can be obtained. Furthermore, an automatic method to identify new and hot topics is presented.
... We have developed the wiki-like platform from scratch simply because existing annotation systems, such as GATE (Dowman et al., 2005), NITE (Carletta et al., 2003), or UIMA (Hahn et al., 2007), do not offer the functionality required for deep semantic annotation combined with crowdsourcing. ...
Conference Paper
Full-text available
Data-driven approaches in computational semantics are not common because there are only few semantically annotated resources available. We are building a large corpus of public-domain English texts and annotate them semi-automatically with syntactic structures (derivations in Combinatory Categorial Grammar) and semantic representations (Discourse Representation Structures), including events, thematic roles, named entities, anaphora, scope, and rhetorical structure. We have created a wiki-like Web-based platform on which a crowd of expert annotators (i.e. linguists) can log in and adjust linguistic analyses in real time, at various levels of analysis, such as boundaries (tokens, sentences) and tags (part of speech, lexical categories). The demo will illustrate the different features of the platform, including navigation, visualization and editing.
... Moreover, in television, semantic annotation of programmes, for example news, could produce electronic programme guides, which would allow the user to -Applications -Vol. 4, n. 3, september 2008 view details of forthcoming programmes in terms of entities referred to in particular broadcasts [Dowman et al., 2005]. ...
Article
Full-text available
The paper presents an ontological approach for enabling semantic-aware video retrieval framework facilitating the user access to desired contents. Through the ontologies, the system will express key entities and relationships describing videos in a formal machine-processable representation. An ontology-based knowledge representation could be used for content analysis and concept recognition, for reasoning processes and for enabling user-friendly and intelligent multimedia content retrieval.
... A number of previous approaches have been taken to the problem of segmentation of text and speech transcripts. Some of these approaches have been based only on differences in the distribution of words in parts of the text dealing with different topics (Hearst, 1994;Choi, 2000;Kan et al, 1998;Dowman et al, 2005), while others have focused on features that are indicative of topic boundaries (Franz et al, 2003;Mulbregt et al, 1998;Kehagias et al, 2004;Maskey and Hirschberg, 2003). Generally, the greatest success has been achieved by combining both kinds of cues into a single system (Chaisorn et al, 2003;Beeferman et al., 1999;Galley et al., 2003). ...
Article
Full-text available
In order to determine the points at which meeting discourse changes from one topic to another, probabilistic models were used to approximate the process through which meeting transcripts were produced. Gibbs sampling was used to estimate parameter values in the models, including the locations of topic boundaries. The paper shows how discourse features were integrated into the Bayesian model, and reports empirical evaluations of the benefit obtained through the inclusion of each feature and of the suitability of alternative models of the placement of topic boundaries. It demonstrates how multiple cues to segmentation can be combined in a principled way, and empirical tests show a clear improvement over previous work.
... We have developed the wiki-like platform from scratch simply because existing annotation systems, such as GATE (Dowman et al., 2005), NITE (Carletta et al., 2003), or UIMA (Hahn et al., 2007), do not offer the functionality required for deep semantic annotation combined with crowdsourcing. ...
Poster
Full-text available
Named Entity Extraction is a mature task in the NLP field that has yielded numerous services gaining popularity in the Semantic Web community for extracting knowledge from web documents. These services are generally organized as pipelines, using dedicated APIs and different taxonomy for extracting, classifying and disambiguating named entities. Integrating one of these services in a particular application requires to implement an appropriate driver. Furthermore, the results of these services are not comparable due to different formats. This prevents the comparison of the performance of these services as well as their possible combination. We address this problem by proposing NERD, a framework which unifies 10 popular named entity extractors available on the web, and the NERD ontology which provides a rich set of axioms aligning the taxonomies of these tools.
... Kostkova et al [14] demonstrated the use of the Semantic Web to provide contextualized browsing in a health portals' web pages. In an earlier study semantic annotation was used to link news broadcast with related online resources [15]. However, semantic annotation from medical reports' text to improve consumer understanding and access to informational resources is still largely unexplored. ...
Article
Full-text available
Patients often have difficulty in understanding medical concepts and vocabulary in their Discharge Summaries. We explore automatic hyper-linking to online resources for difficult terms as a means of making the content more comprehensible for patients. We use the Consumer Health Vocabulary (CHV) as a resource for scoring the difficulty of terms and to provide the most consumer-friendly synonyms. We implement a term extraction component providing semantic annotation using the KIM Knowledge and Information Management Platform. We hyperlink these terms to pages indexed by MedLinePlus to provide consumer-friendly online explanations. A web interface allows for viewing annotated Discharge Summaries and browsing search results. In a preliminary evaluation, the system was used to annotate eight Clinical Management sections of Discharge Summaries. The automatic hyper-linking provides good precision in linking to topically-relevant pages indexed by MedLinePlus. Our approach shows promise as a technology to deploy in future portals where consumers view their Discharge Summaries online.
... Nakov and Hearst [30] have shown the power of using the Web as training data for natural language analysis. Web-assistance for extracting keywords for the purposes of content indexing and annotation is studied in [12, 37, 26] . This work is focused on automated, Web-based tools for understanding the meaning of the text as written, as opposed to the inferences that can be drawn based on the text. ...
Article
Newly published data, when combined with existing public knowledge, allows for complex and sometimes unintended inferences. We propose semi-automated tools for detecting these inferences prior to releasing data. Our tools give data owners a fuller understanding of the implications of releasing data and help them ad-just the amount of data they release to avoid unwanted inferences. Our tools first extract salient keywords from the pri-vate data intended for release. Then, they issue search queries for documents that match subsets of these key-words, within a reference corpus (such as the public Web) that encapsulates as much of relevant public knowl-edge as possible. Finally, our tools parse the documents returned by the search queries for keywords not present in the original private data. These additional keywords allow us to automatically estimate the likelihood of cer-tain inferences. Potentially dangerous inferences are flagged for manual review. We call this new technology Web-based inference control. The paper reports on two experiments which demonstrate early successes of this technology. The first experiment shows the use of our tools to automatically estimate the risk that an anonymous document allows for re-identification of its author. The second experiment shows the use of our tools to detect the risk that a doc-ument is linked to a sensitive topic. These experiments, while simple, capture the full complexity of inference de-tection and illustrate the power of our approach.
... On the other hand, semantic annotation on videos also serves the similar purpose for indexing and retrieval, widely known in Content Based Image Retrieval (CBIR) [14]. The existing video annotation approaches are based on the analysis of transliteration or transcripts of a video recording [21][22][23], or from the motion detected and extracted from a video recording [24,25]. The later approach is usually ontology-based, where the ontology serves as the knowledge foundation for annotation. ...
Article
With the advent of various services and applications of Semantic Web, semantic annotation had emerged as an important research area. The use of semantically annotated ontology had been evident in numerous information processing and retrieval tasks. One of such tasks is utilizing the semantically annotated ontology in product design which is able to suggest many important applications that are critical to aid various design related tasks. However, ontology development in design engineering remains a time consuming and tedious task that demands tremendous human efforts. In the context of product family design, management of different product information that features efficient indexing, update, navigation, search and retrieval across product families is both desirable and challenging. This paper attempts to address this issue by proposing an information management and retrieval framework based on the semantically annotated product family ontology. Particularly, we propose a document profile (DP) model to suggest semantic tags for annotation purpose. Using a case study of digital camera families, we illustrate how the faceted search and retrieval of product information can be accomplished based on the semantically annotated camera family ontology. Lastly, we briefly discuss some further research and application in design decision support, e.g. commonality and variety, based on the semantically annotated product family ontology.
... Differently, in education, semantic annotations of video recording of lectures distributed over the Internet can be used to augment the material by providing explanations, references or examples, that can be used for efficiently access, find and review material in a student personal manner [4], [5]. Moreover, in television, semantic annotation of programmes, for example news, could produce electronic programme guides, which would allow the user to view details of forthcoming programmes in terms of entities referred to in particular broadcasts [6]. In this paper, we propose how to introduce a semantic-based representation of the information embedded in the video media both by user interaction and by ontology exploiting. ...
Article
Full-text available
In this paper we propose ontology-based video content annotation and recommendation tools. Our system is able to perform automatic shot detection and supports users during the annotation phase in a collaborative framework by providing suggestions on the basis of actual user needs as well as modifiable user behaviour and interests. Annotations are based on domain ontologies expressing hierarchical links between entities and guarantying interoperability of resources. Examples to verify the effectiveness of both the shot detection and the frame matching modules are analyzed.
... For example, Narayan et al [17] developed a multi-modal mobile interface combining speech and text for accessing web information through a personalized dialog. Dowman et al [5] dealt with the problem of indexing and searching radio and television news using both speech and text. In the current work, we explore the potential of the particular modality pair of image and text for searching computer-related articles. ...
Article
Many online articles contain useful know-how knowledge about GUI applications. Even though these articles tend to be richly illustrated by screenshots, no system has been designed to take advantage of these screenshots to visu-ally search know-how articles effectively. In this paper, we present a novel system to index and search software know-how articles that leverages the visual correspondences be-tween screenshots. To retrieve articles about an application, users can take a screenshot of the application to query the system and retrieve a list of articles containing a matching screenshot. Useful snippets such as captions, references, and nearby text are automatically extracted from the retrieved articles and shown alongside with the thumbnails of the matching screenshots as excerpts for relevancy judgement. Retrieved articles are ranked by a comprehensive set of vi-sual, textual, and site features, whose weights are learned by RankSVM. Our prototype system currently contains 150k articles that are classified into walkthrough, book, gallery, and general categories. We demonstrated the system's abil-ity to retrieve matching screenshots for a wide variety of programs, across language boundaries, and provide subjec-tively more useful results than keyword-based web and im-age search engines.
... Several studies are carried out on employing especially NER output for semantic multimedia annotation.  Multimedia indexing system for English, German and Dutch football videos (Saggion et al., 2004)  Video annotation system for Italian news videos (Basili et al., 2005)  Automatic annotation system for BBC radio and TV news (Dowman et al., 2005) ...
Conference Paper
Full-text available
Named entity recognition (NER) is one of the main information extraction tasks and research on NER from Turkish texts is known to be rare. In this study, we present a rule-based NER system for Turkish which employs a set of lexical resources and pattern bases for the extraction of named entities including the names of people, locations, organizations together with time/date and money/percentage expressions. The domain of the system is news texts and it does not utilize important clues of capitalization and punctuation since they may be missing in texts obtained from the Web or the output of automatic speech recognition tools. The evaluation of the system is performed on news texts along with other genres encompassing child stories and historical texts, but as expected in case of manually engineered rule-based systems, it suffers from performance degradation on these latter genres of texts since they are distinct from the target domain of news texts. Furthermore, the system is evaluated on transcriptions of news videos leading to satisfactory results which is an important step towards the employment of NER during automatic semantic annotation of videos in Turkish. The current study is significant for its being the first rule-based approach to the NER task on Turkish texts with its evaluation on diverse text types.
... However, the real world is not machine readable and therefore knowledge need to be extracted from documents, audio and video recordings or data streams. The extraction of information from content [8,9] and the detection of senses [36] are major research fields and the interested reader is referred to the related literature for details. However, it should be noted that often these annotation are done manually, e.g. by tagging the song "Satisfaction" with the type "Rock", as extraction technologies for audiovisual data still has a lot of limitations. ...
Article
Full-text available
The medium is the message! And the message was literacy, media democracy and music charts. Mostly one single distinguishable medium such as TV, the Web, the radio, or books transmitted the message. Now in the age of ubiquitous and pervasive computing, where information flows through a plethora of distributed interlinked media—what is the message ambient media will tell us? What does semantic mean in this context? Which experiences will it open to us? What is content in the age of ambient media? Ambient media are embedded throughout the natural environment of the consumer—in his home, in his car, in restaurants, and on his mobile device. Predominant sample services are smart wallpapers in homes, location based services, RFID based entertainment services for children, or intelligent homes. The goal of this article is to define semantic ambient media and discuss the contributions to the Semantic Ambient Media Experience (SAME) workshop, which was held in conjunction with the ACM Multimedia conference in Vancouver in 2008. The results of the workshop can be found on: www.ambientmediaassociation.org.
... On the other hand, semantic annotation on videos also serves the similar purpose for indexing and retrieval, widely known in Content Based Image Retrieval (CBIR) (Chang & Liu, 1984). The existing video annotation approaches are based on the analysis of transliterations or transcripts of a video recording (Dowman, Tablan, Cunningham, & Popov, 2005;Repp, Linckels, & Meinel, 2007, 2008, or from the motion detected and extracted from a video recording (Bertini, Bimbo, Cucchiara, & Prati, 2004;Bertini, Bimbo, & Torniai, 2006). The latter approach is usually ontology-based, where the ontology serves as the knowledge foundation for annotation. ...
... [23]), interpreting search engine queries (e.g. [11]), automatic indexing and annotation [7, 8] and problems in structural linguistics like discovering conventional expressions (e.g. [17]). ...
Conference Paper
Detecting inferences in documents is critical for ensuring privacy when sharing information. In this paper, we propose a refined and practical model of inference detection using a reference corpus. Our model is inspired by association rule mining: inferences are based on word co-occurrences. Using the model and taking the Web as the reference corpus, we can find inferences and measure their strength through web-mining algorithms that leverage search engines such as Google or Yahoo!. Our model also includes the important case of private corpora, to model inference detection in enterprise settings in which there is a large private document repository. We find inferences in private corpora by using analogues of our Web-mining algorithms, relying on an index for the corpus rather than a Web search engine. We present results from two experiments. The first experiment demonstrates the performance of our techniques in identifying all the keywords that allow for inference of a particular topic (e.g. "HIV") with confidence above a certain threshold. The second experiment uses the public Enron e-mail dataset. We postulate a sensitive topic and use the Enron corpus and the Web together to find inferences for the topic. These experiments demonstrate that our techniques are practical, and that our model of inference based on word co-occurrence is well-suited to efficient inference detection.
Chapter
Many organizations have attempted to automatically exploit the data embedded in web pages and enrich the web with a semantic dimension. Data should follow the principles outlined by Tim Berners Lee, which is based on traditional web technologies, such as, Uniform Resource Identifier (URI) and Hypertext Transmission Protocol (HTTP), and semantic web technologies, including knowledge representation languages, such as, Resource Description Framework (RDF), as well as links to other data. The uses of linked data technology are numerous and varied. The case of the British Broadcasting Corporation (BBC) is the most widely reported success story of linked data technology usage in the literature. This success stems from the use of linked data in the BBC web portal, which enables the site to present rich content that is automatically updated from linked data cloud. The aim of this study is to analyze the literature relating to this case study to derive approaches and technologies for linked data usage and propose a group of best practices, as well as a generic approach that can be used by web developers.
Chapter
In these last years, many works have been published in the video indexing and retrieval field. However, few are the methods that have been designed to Arabic video. This paper’s aim is to achieve a new approach for Arabic news video indexing based on embedded text as the information source and Knowledge extraction techniques to provide a conceptual description of video content. Firstly, we applied a low level processing in order to detect and recognize the video texts. Then, we extract the conceptual information including name of person, Organization and location using local grammars that have been implemented with the linguistic platform NooJ. Our proposed approach was tested on a large collection of Arabic TV news and experimental results were satisfactory.
Book
This book introduces core natural language processing (NLP) technologies to non-experts in an easily accessible way, as a series of building blocks that lead the user to understand key technologies, why they are required, and how to integrate them into Semantic Web applications. Natural language processing and Semantic Web technologies have different, but complementary roles in data management. Combining these two technologies enables structured and unstructured data to merge seamlessly. Semantic Web technologies aim to convert unstructured data to meaningful representations, which benefit enormously from the use of NLP technologies, thereby enabling applications such as connecting text to Linked Open Data, connecting texts to each other, semantic searching, information visualization, and modeling of user behavior in online networks. The first half of this book describes the basic NLP processing tools: tokenization, part-of-speech tagging, and morphological analysis, in addition to the main tools...
Article
We performed an exploratory case study to understand how subject indexing performed by television production staff using a semicontrolled vocabulary affects indexing quality. In the study we used triangulation, combining tag analysis and semistructured interviews, with production staff of the Norwegian Broadcasting Corporation. The main findings reveal incomplete indexing of TV programs and their parts, in addition to low indexing consistency and uneven indexing exhaustivity. The informants expressed low motivation and a high level of uncertainty regarding the task. Internal guidelines and high domain knowledge among the indexers does not form a sufficient basis for creating quality and consistency in the vocabulary. The challenges that are revealed in the terminological analysis, combined with low indexing knowledge and lack of motivation, will create difficulties in the retrieval phase.
Conference Paper
The BBC has a very large archive of programmes, covering a wide range of topics. This archive holds a significant part of the BBC’s institutional memory and is an important part of the cultural history of the United Kingdom and the rest of the world. These programmes, or parts of them, can help provide valuable context and background for current news events. However the BBC’s archive catalogue is not a complete record of everything that was ever broadcast. For example, it excludes the BBC World Service, which has been broadcasting since 1932. This makes the discovery of content within these parts of the archive very difficult. In this paper we describe a system based on Semantic Web technologies which helps us to quickly locate content related to current news events within those parts of the BBC’s archive with little or no pre-existing metadata. This system is driven by automated interlinking of archive content with the Semantic Web, user validations of the resulting data and topic extraction from live BBC News subtitles. The resulting interlinks between live news subtitles and the BBC’s archive are used in a dynamic visualisation enabling users to quickly locate relevant content. This content can then be used by journalists and editors to provide historical context, background information and supporting content around current affairs.
Article
The British Broadcasting Corp. (BBC) manually tags recent programs on its website. Editors draw and assign these tags from open datasets made available within the Linked Data cloud, but this is a time-consuming process. Aside from recent programming, which is tagged, the BBC has a large radio archive that is untagged. Thus the possibility of automatically assigning tags to programs in a reasonable amount of time has been investigated. Tags enable a variety of use cases, such as dynamic building of topical aggregations, retrieval through topic-based search, or cross-domain navigation. Automatic tagging of archive content would ensure archive programs are as findable as recent programs. It would mean that topic-based collections of archive content can be easily built, for example, to find archive content that relates to current news events. This paper describes an infrastructure to process large program archives in a cost-effective and scalable manner using Amazon Web Services. An automated tagging algorithm using speech audio as an input is described. The paper also explains how this algorithm can be separated and distributed and how the workflow can be managed robustly, ensuring appropriate error handling, resource monitoring, and data management on a large scale. Finally, the results from processing the BBC World Service English-speaking audio archive are presented.
Article
Purpose – The purpose of this paper is to investigate the use of television (TV) content for scholarly purposes. It focuses on: profile of scholars using TV content; the structure of their need for TV content; the situations in which scholars need TV content; and their patterns of use of TV content in each research stage. Design/methodology/approach – Taylor’s four components of the information use environment has contributed to the development of a conceptual framework. The data from the use of TV content by 668 scholars were profiled using correspondence analysis and co-word analysis. Additionally, the data from 15 interviews and content from 240 journal articles were analysed. Findings – The authors determined that the environment of the scholarly use of TV content is unique in terms of the scholars’ academic domains, research topics, motivation, and patterns of use. Six academic domains were identified as having used TV content to a meaningful degree, and their knowledge structure was presented as a map depicting the scholars’ needs for TV content. Scholars are likely to use TV content when they deal with timely social and cultural topics, or human behaviour. The scholars also showed different patterns of use of TV content at each stage of research. Originality/value – In this study, TV content was newly examined from the perspective of an information source for scholarly purposes, and it was found to be a meaningful source in several domains. This result extends the knowledge of information sources in scholarly communication and information services.
Conference Paper
Document redaction is widely used to protect sensitive information in published documents. In a basic redaction system, sensitive and identifying terms are removed from the document. Web-based inference is an attack on redaction systems whereby the redacted document is linked with other publicly available documents to infer the removed parts. Web-based inference also provides an approach for detecting unwanted inferences and so constructing secure redaction systems. Previous works on web-based inference used general keyword extraction methods for document representation. We propose a systematic approach, based on information theoretic concepts and measures, to rank the words in a document for purpose of inference detection. We extend our results to the case of multiple sensitive words and propose a metric that takes into account possible relationship of the sensitive words and results in an effective and efficient inference detection system. Using a number of experiments we show that our approach, when used for document redaction, substantially reduce the number of inferences that are left in a document. We describe our approach, present the experiment results, and outline future work.
Conference Paper
In this paper, we give an overview of recent BBC R&D work on automated affective and semantic annotations of BBC archive content, covering different types of use-cases and target audiences. In particular, after giving a brief overview of manual cataloguing practices at the BBC, we focus on mood classification, sound effect classification and automated semantic tagging. The resulting data is then used to provide new ways of finding or discovering BBC content. We describe two such interfaces, one driven by mood data and one driven by semantic tags.
Article
It is commonly acknowledged that ever-increasing video archives should be conveniently indexed with the conveyed semantic information to facilitate later video retrieval. Domain-independent semantic video indexing is usually carried out through manual means which is too time-consuming and labor-intensive to be employed in practical settings. On the other hand, fully automated approaches are usually proposed for very specialized domains such as team sports videos. In this paper, we propose a generic text-based semi-automatic system for off-line semantic indexing and retrieval of news videos, since video texts such as speech transcripts stand as a plausible source of semantic information. The proposed system has a pipelined flow of execution where the sole manual intervention takes place during text extraction, yet it could execute in fully automated mode in case the associated video text is already available or a convenient text extractor is available to be incorporated into the system. At the core of the system is an information extraction component – a named entity recognizer – which extracts representative semantic information from the video texts. Based on the proposed generic system, a novel semantic annotation and retrieval system for Turkish is designed, implemented, and evaluated on two distinct news video data sets. By equipping it with the necessary components, the ultimate system is also turned into a multilingual video retrieval system and executed on a video data set in English, thereby facilitating multilingual semantic video retrieval.
Chapter
In this paper we describe a method to automatically discover important concepts and their relationships in e-Lecture material. The discovered knowledge is used to display semantic aware categorizations and query suggestions for facilitating navigation inside an unstructured multimedia repository of e-Lectures. We report about an implemented approach for dealing with learning materials referring to the same event in different languages. The information acquired from the speech is combined with the documents such as presentation slides, which are temporally synchronized with the video for creating new knowledge through a mapping with a taxonomy representation such as Wikipedia.
Article
The focus of much of the research on providing user-centered control of multimedia has been on the definition of models and (meta-data) descriptions that assist in locating or recommending media objects. While this can provide a more efficient means of selecting content, it provides little extra control for users once that content is rendered. In this article, we consider various means for supporting user-centered control of media within a collection of objects that are structured into a multimedia presentation. We begin with an examination of the constraints of user-centered control based on the characteristics of multimedia applications and the media processing pipeline. We then define four classes of control that can enable a more user-centric manipulation within media content. Each of these control classes is illustrated in terms of a common news viewing system. We continue with reflections on the impact of these control classes on the development of multimedia languages, rendering infrastructures and authoring systems. We conclude with a discussion of our plans for infrastructure support for user-centered multimedia control.
Article
Full-text available
This paper deals with multimedia information access. We propose two new approaches for hybrid text-image information processing that can be straightforwardly generalized to the more general multimodal scenario. Both approaches fall in the trans-media pseudo-relevance feedback category. Our first method proposes using a mixture model of the aggregate components, considering them as a single relevance concept. In our second approach, we define trans-media similarities as an aggregation of monomodal similarities between the elements of the aggregate and the new multimodal object. We also introduce the monomodal similarity measures for text and images that serve as basic components for both proposed trans-media similarities. We show how one can frame a large variety of problem in order to address them with the proposed techniques: image annotation or captioning, text illustration and multimedia retrieval and clustering. Finally, we present how these methods can be integrated in two applications: a travel blog assistant system and a tool for browsing the Wikipedia taking into account the multimedia nature of its content.
Conference Paper
Knowledge develops and spreads rapidly in quality, quantity, profundity and extension. The key source of sustainable competitive advantage relies on the way to create, share, and utilize knowledge. This paper discusses the necessity of using knowledge elements for managing and representing knowledge, presents the framework of the main system, and describes the process of the knowledge elements mining subsystem and the main relationships of knowledge elements.
Conference Paper
Video texts - if available - constitute a valuable source for automatic semantic annotation of large video archives. In this paper, we present our attempts towards the improvement of a text-based semantic annotation and retrieval system for Turkish news videos through automatic Web alignment and event extraction. The results of our initial experiments turn out to be promising and these two features are incorporated into the existing system. Although the ideas of automatic Web alignment and text-based event extraction are not the novel contributions of the current paper, to the best of our knowledge, their first implementation and employment in a system for Turkish news videos is a significant contribution to related work on videos in lesser studied languages such as Turkish. Also overviewed in the current paper is the prospective version of the system encompassing components for several other tasks including topic segmentation, keyphrase extraction, news categorization and summarization to enhance the overall system.
Article
Full-text available
This paper presents an enhanced work from our previous paper (Chaisorn et al. 2002). The system is enhanced to perform news story segmentation on a large video corpus used in TRECVID 2003 evaluation. We use a combination of features include visual-based features such as color, object-based features such as face, video-text, temporal features such as audio and motion, and semantic feature such as cue-phrases. We employ Decision Tree and specific detectors to perform shot classification/tagging. We use the shot category information along with two temporal features to identify story boundaries using HMM (Hidden Markov Models). A heuristic rules-based technique is applied to classify each detected story into "news" or "misc".
Article
Full-text available
In this paper, we use Barry and Hartigan's Product Partition Models to formulate text segmentation as an optimization problem, which we solve by a fast dynamic programming algorithm. We test the algorithm on Choi's segmentation benchmark and achieve the best segmentation results so far reported in the literature.
Conference Paper
Full-text available
The KIM platform provides a novel Knowledge and Information Management infrastructure and services for automatic semantic annotation, indexing, and retrieval of documents. It provides mature infrastructure for scaleable and customizable information extraction (IE) as well as annotation and document management, based on GATE. In order to provide basic level of performance and allow easy bootstrapping of applications, KIM is equipped with an upper-level ontology and a knowledge base providing extensive coverage of entities of general importance. The ontologies and knowledge bases involved are handled using cutting edge Semantic Web technology and standards, including RDF(S) repositories, ontology middleware and reasoning. From technical point of view, the platform allows KIM-based applications to use it for automatic semantic annotation, content retrieval based on semantic restrictions, and querying and modifying the underlying ontologies and knowledge bases. This paper presents the KIM platform, with emphasize on its architecture, interfaces, tools, and other technical issues.
Conference Paper
Full-text available
Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. A limitation of previous keyphrase extraction algorithms is that the selected keyphrases are occasionally incoherent. That is, the majority of the output keyphrases may fit together well, but there may be a minority that appear to be outliers, with no clear semantic relation to the majority or to each other. This paper presents enhancements to the Kea keyphrase extraction algorithm that are designed to increase the coherence of the extracted keyphrases. The approach is to use the degree of statistical association among candidate keyphrases as evidence that they may be semantically related. The statistical association is measured using web mining. Experiments demonstrate that the enhancements improve the quality of the extracted keyphrases. Furthermore, the enhancements are not domain-specific: the algorithm generalizes well when it is trained on one domain (computer science documents) and tested on another (physics documents).
Article
Full-text available
Keyphrases are useful for a variety of purposes, including summarizing, indexing, labeling, categorizing, clustering, highlighting, browsing, and searching. The task of automatic keyphrase extraction is to select keyphrases from within the text of a given document. Automatic keyphrase extraction makes it feasible to generate keyphrases for the huge number of documents that do not have manually assigned keyphrases. A limitation of previous keyphrase extraction algorithms is that the selected keyphrases are occasionally incoherent. That is, the majority of the output keyphrases may fit together well, but there may be a minority that appear to be outliers, with no clear semantic relation to the majority or to each other. This paper presents enhancements to the Kea keyphrase extraction algorithm that are designed to increase the coherence of the extracted keyphrases. The approach is to use the degree of statistical association among candidate keyphrases as evidence that they may be semantically related. The statistical association is measured using web mining. Experiments demonstrate that the enhancements improve the quality of the extracted keyphrases. Furthermore, the enhancements are not domain-specific: the algorithm generalizes well when it is trained on one domain (computer science documents) and tested on another (physics documents).
Article
Full-text available
This paper describes a method for linear text segmentation that is more accurate or at least as accurate as state-of-the-art methods (Utiyama and Isahara, 2001; Choi, 2000a). Inter-sentence similarity is estimated by latent semantic analysis (LSA). Boundary locations are discovered by divisive clustering. Test results show LSA is a more accurate similarity measure than the cosine metric (van Rijsbergen, 1979).
Article
Full-text available
Keyphrases are an important means of document summarization, clustering, and topic search. Only a small minority of documents have author-assigned keyphrases, and manually assigning keyphrases to existing documents is very laborious. Therefore it is highly desirable to automate the keyphrase extraction process. This paper shows that a simple procedure for keyphrase extraction based on the naiveBayes learning scheme performs comparably to the state of the art. It goes on to explain how this procedure's performance can be boosted by automatically tailoring the extraction process to the particular document collection at hand. Results on a large collection of technical reports in computer science show that the quality of the extracted keyphrases improves signi#cantly when domain-speci#c information is exploited. 1 Introduction Keyphrases give a high-level description of a document's contents that is intended to make it easy for prospective readers to decide whether or no...
Article
Full-text available
We present a new method for discovering a segmental discourse structure of a document while categorizing each segment's function and importance. Segments are determined by a zero-sum weighting scheme, used on occurrences of noun phrases and pronominal forms retrieved from the document. Segment roles are then calculated from the distribution of the terms in the segment. Finally, we present results of evaluation in terms of precision and recall which surpass earlier approaches'.
Article
Full-text available
This paper described the THISL spoken document retrieval system for British and North American Broadcast News. The system is based on the ABBOT large vocabulary speech recognizer and a probabilistic text retrieval system. We discuss the development of a realtime British English Broadcast News system, and its integration into a spoken document retrieval system. Detailed evaluation is performed using a similar North American Broadcast News system, to take advantage of the TREC SDR evaluation methodology. We report results on this evaluation, with particular reference to the effect of query expansion and of automatic segmentation algorithms. 1.INTRODUCTION THISL is an ESPRIT Long Term Research project in the area of speech retrieval. It is concerned with the construction of a system which performs good recognition of broadcast speech from television and radio news programmes, from which it can produce multimedia indexing data. The principal objective of the project is to construct a spo...
Article
The approach towards Semantic Web Information Extraction (IE) presented here is implemented in KIM - a platform for semantic indexing, annotation, and retrieval. It combines IE based on the mature text engineering platform (GATE1) with Semantic Web-compliant knowledge representation and management. The cornerstone is automatic generation of named-entity (NE) annotations with class and instance references to a semantic repository. Simplistic upper-level ontology, providing detailed coverage of the most popular entity types (Person, Organization, Location, etc.; more than 250 classes) is designed and used. A knowledge base (KB) with de-facto exhaustive coverage of real-world entities of general importance is maintained, used, and constantly enriched. Extensions of the ontology and KB take care of handling all the lexical resources used for IE, most notable, instead of gazetteer lists, aliases of specific entities are kept together with them in the KB. A Semantic Gazetteer uses the KB to generate lookup annotations. Ontology- aware pattern-matching grammars allow precise class information to be handled via rules at the optimal level of generality. The grammars are used to recognize NE, with class and instance information referring to the KIM ontology and KB. Recognition of identity relations between the entities is used to unify their references to the KB. Based on the recognized NE, template relation construction is performed via grammar rules. As a result of the latter, the KB is being enriched with the recognized relations between entities. At the final phase of the IE process, previously unknown aliases and entities are being added to the KB with their specific types.
Article
The KIM platform provides a novel Knowledge and Information Management framework and services for automatic semantic annotation, indexing, and retrieval of documents. It provides a mature and semantically enabled infrastructure for scalable and customizable information extraction (IE) as well as annotation and document management, based on GATE. 1 Our understanding is that a system for semantic annotation should be based upon a simple model of real-world entity concepts, complemented with quasi-exhaustive instance knowledge. To ensure efficiency, easy sharing, and reusability of the metadata we introduce an upper-level ontology. Based on the ontology, a large-scale instance base of entity descriptions is maintained. The knowledge resources involved are handled by use of state-of-the-art Semantic Web technology and standards, including RDF(S) repositories, ontology middleware and reasoning. From a technical point of view, the platform allows KIM-based applications to use it for automatic semantic annotation, for content retrieval based on semantic queries, and for semantic repository access. As a framework, KIM also allows various IE modules, semantic repositories and information retrieval engines to be plugged into it. This paper presents the KIM platform, with an emphasis on its architecture, interfaces, front-ends, and other technical issues.
Chapter
This chapter details the value and methods for content augmentation and personalization among different media such as TV and Web. We illustrate how metadata extraction can aid in combining different media to produce a novel content consumption and interaction experience. We present two pilot content augmentation applications. The first, called MyInfo, combines automatically segmented and summarized TV news with information extracted from Web sources. Our news summarization and metadata extraction process employs text summarization, anchor detection and visual key element selection. Enhanced metadata allows matching against the user profile for personalization. Our second pilot application, called InfoSip, performs person identification and scene annotation based on actor presence. Person identification relies on visual, audio, text analysis and talking face detection. The InfoSip application links person identity information with filmographies and biographies extracted from the Web, improving the TV viewing experience by allowing users to easily query their TVs for information about actors in the current scene.
Article
This paper introduces a new statistical approach to automatically partitioning text into coherent segments. The approach is based on a technique that incrementally builds an exponential model to extract features that are correlated with the presence of boundaries in labeled training text. The models use two classes of features: topicality features that use adaptive language models in a novel way to detect broad changes of topic, and cue-word features that detect occurrences of specific words, which may be domain-specific, that tend to be used near segment boundaries. Assessment of our approach on quantitative and qualitative grounds demonstrates its effectiveness in two very different domains, Wall Street Journal news articles and television broadcast news story transcripts. Quantitative results on these domains are presented using a new probabilistically motivated error metric, which combines precision and recall in a natural and flexible way. This metric is used to make a quantitative assessment of the relative contributions of the different feature types, as well as a comparison with decision trees and previously proposed text segmentation algorithms.
Article
The Semantic Web realization depends on the availability of a critical mass of metadata for the web content, associated with the respective formal knowledge about the world. We claim that the Semantic Web, at its current stage of development, is in a state of a critically need of metadata generation and usage schemata that are specific, well-defined and easy to understand. This paper introduces our vision for a holistic architecture for semantic annotation, indexing, and retrieval of documents with regard to extensive semantic repositories. A system (called KIM), implementing this concept, is presented in brief and it is used for the purposes of evaluation and demonstration. A particular schema for semantic annotation with respect to real-world entities is proposed. The underlying philosophy is that a practical semantic annotation is impossible without some particular knowledge modelling commitments. Our understanding is that a system for such semantic annotation should be based upon a simple model of real-world entity classes, complemented with extensive instance knowledge. To ensure the efficiency, ease of sharing, and reusability of the metadata
Article
This paper describes a spoken document retrieval (SDR) system for British and North American Broadcast News. The system is based on a connectionist large vocabulary speech recognizer and a probabilistic information retrieval system. We discuss the development of a realtime Broadcast News speech recognizer, and its integration into an SDR system. Two advances were made for this task: automatic segmentation and statistical query expansion using a secondary corpus. Precision and recall results using the Text Retrieval Conference (TREC) SDR evaluation infrastructure are reported throughout the paper, and we discuss the application of these developments to a large scale SDR task based on an archive of British English broadcast news.
Conference Paper
This paper presents a new method for topic-based document segmentation, i.e., the identification of boundaries between parts of a document that bear on different topics. The method combines the use of the Probabilistic Latent Semantic Analysis (PLSA) model with the method of selecting segmentation points based on the similarity values between pairs of adjacent blocks. The use of PLSA allows for a better representation of sparse information in a text block, such as a sentence or a sequence of sentences. Furthermore, segmentation performance is improved by combining different instantiations of the same model, either using different random initializations or different numbers of latent classes. Results on commonly available data sets are significantly better than those of other state-of-the-art systems.
Conference Paper
Digital archives have emerged as the pre-eminent method for capturing the human experience. Before such archives can be used efficiently, their contents must be described. The scale of such archives along with the associated con- tent mark up cost make it impractical to provide access via purely manual means, but automatic technologies for search in spoken materials still have relatively limited capabilities. The NSF-funded MALACH project will use the world's largest digital archive of video oral histories, collected by the Survivors of the Shoah Visual History Foundation (VHF) to make a quantum leap in the ability to access such archives by advancing the state-of-the-art in Automated Speech Recognition (ASR), Natural Language Processing (NLP) and related technologies (1, 2). This corpus consists of over 115,000 hours of unconstrained, natural speech from 52,000 speakers in 32 different languages, filled with disfluencies, heavy accents, age-related coarticulations, and un-cued speaker and language switching. Thispaper discusses some of theASR and NLPtools and technologies that we have been building for the English speech in the MALACH corpus. We also discuss this new test bed while emphasizing the unique characteristics of this corpus.
Article
Title generation is a complex task involving both natural language understanding and natural language synthesis. In this paper, we propose a new probabilistic model for title generation. Different from the previous statistical models for title generation, which treat title generation as a generation process that converts the `document representation' of information directly into a `title representation' of the same information, this model introduces a hidden state called `information source' and divides title generation into two steps, namely the step of distilling the `information source' from the observation of a document and the step of generating a title from the estimated `information source'. In our experiment, the new probabilistic model outperforms the previous model for title generation in terms of both automatic evaluations and human judgments.
Article
This paper describes a method for linear text seg- mc. ntation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering.
Article
Continuing progress in the automatic transcription of broadcast speech via speech recognition has raised the possibility of applying information retrieval techniques to the resulting (errorful) text. In this paper we describe a general methodology based on Hidden Markov Models and classical language modeling techniques for automatically inferring story boundaries (segmentation) and for retrieving stories relating to a specific topic (tracking). We will present in detail the features and performance of the Segmentation and Tracking systems submitted by Dragon Systems for the 1998 Topic Detection and Tracking evaluation. 1. INTRODUCTION Over the last few years Dragon, like a number of other research sites, has been developing a speech recognition system capable of automatically transcribing broadcast speech. With the recent advances in this technology, a new source is becoming available for information mining, in the form of a continuous stream of errorful, unsegmented text. Applying s...
Article
This paper documents the Information Extraction Named-Entity Evaluation (IE-NE), one of the new spokes added to the DARPA-sponsored 1998 Hub-4 Broadcast News Evaluation. This paper discusses the information extraction task as posed for the 1998 Broadcast News Evaluation. This paper reviews the evaluation metrics, the scoring process, and the test corpus that was used for the evaluation. Finally, this paper reviews the results of the first running of a Hub-4 IE-NE Evaluation. The Baseline IE-NE evaluation, in which BBN's IdentiFinder was run on the primary system transcripts submitted for the Hub-4 Broadcast News evaluation, found that the transcripts generated by LIMSI's automatic speech recognition system produced the "highest" F-measure score (82.39). In the Quasi IE-NE evaluation, where sites ran their own NEtaggers on a set of three baseline recognizer transcripts, the SRI developed tagger achieved the highest F-measure score for baseline recognizers 1 & 3, while the BBN develop...
The use of recurrent networks in continuous speech recognition Automatic speech and speaker recognition – advanced topics
  • T Robinson
  • M Hochberg
  • S Renals
Robinson, T., Hochberg, M. and Renals, S. The use of recurrent networks in continuous speech recognition. In C. H. Lee, K. K. Paliwal and F. K. Soong (Eds.), Automatic speech and speaker recognition – advanced topics, 233-258, Kluwer Academic Publishers, Boston, 1996.