Fig 7 - uploaded by Tim Finin
Content may be subject to copyright.
Entity types with 400 topics  

Entity types with 400 topics  

Source publication
Conference Paper
Full-text available
Topic models are widely used to thematically describe a collection of text documents and have become an important technique for systems that measure document similarity for classification, clustering, segmentation, entity linking and more. While they have been applied to some non-text domains, their use for semi-structured graph data, such as RDF,...

Context in source publication

Context 1
... our test set was relatively small, we were able to see how precision changed based on data variations. As can be seen in Figure 6 we saw the highest precision using predicates and objects and in Figure 7 we saw the highest precision using predi- cates and objects that included the Wordnet synsets and definitions. Though it was clear that objects alone did not perform as well as including the predicate, future work will further explore the relationship between supplemental data and the number of topics chosen for the model. ...

Similar publications

Conference Paper
Full-text available
This work presents a novel ontology-based approach for the complementation of technical specifications of cyber-physical system components using ontological classification and reasoning. We build on the AutomationML standard and outline how data represented with it can be transformed into an RDF instance graph. We exemplarily show how complementary...

Citations

... In NELLIE, we adopted KG matching techniques based on document similarity, where we generated one document per KG. Sleeman et al. [27] developed a method for using topic modeling with RDF data. This method produces a separate document for each entity described in a KG, whereas our method generates documents that describe an entire KG. ...
Article
Full-text available
Knowledge graphs (KGs) that follow the Linked Data principles are created daily. However, there are no holistic models for the Linked Open Data (LOD). Building these models( i.e., engineering a pipeline system) is still a big challenge in order to make the LOD vision comes true. In this paper, we address this challenge by presenting NELLIE, a pipeline architecture to build a chain of modules, in which each of our modules addresses one data augmentation challenge. The ultimate goal of the proposed architecture is to build a single fused knowledge graph out of the LOD. NELLIE starts by crawling the available knowledge graphs in the LOD cloud. It then finds a set of matching KG pairs. NELLIE uses a two-phase linking approach for each pair (first an ontology matching phase, then an instance matching phase). Based on the ontology and instance matching, NELLIE fuses each pair of knowledge graphs into a single knowledge graph. The resulting fused KG is then an ideal data source for knowledge-driven applications such as search engines, question answering, digital assistants and drug discovery. Our evaluation shows an improved Hit @1 score of the link prediction task on the resulting fused knowledge graph by NELLIE in up to 94.44% of the cases. Our evaluation also shows a runtime improvement by several orders of magnitude when comparing our two-phases linking approach with the estimated runtime of linking using a naïve approach.
... In this direction, e.g., Nickel et al. [26] have focused on incorporating ontological knowledge to a factorization sparse vector for learning relations of the YAGO Knowledge Base. Other solutions (Paulheim and Bizer [29], Sleeman and Fini [30], Sleeman et al [39]; [40]) used statistical and machine learning approaches, as probabilistic and topic modeling, focusing on enhancing the DBpedia knowledge base. ...
Article
Full-text available
Knowledge bases allow data organization and exploration, making easier the data semantic understanding and its use by machines. Traditional strategies for knowledge base construction and augmentation have mostly relied on manual effort or automatic extraction of content from structured and semi-structured sources. In this work, we present DeepEx, a system that autonomously extracts missing attributes of entities in knowledge bases from unstructured text. We use Wikipedia as data source. Given entities on Wikipedia represented by their articles (text and infobox), DeepEx uses a classifier to detect sentences in the articles mentioning the possible missing attributes of the entities and then employs a deep-learning extraction model on those sentences to identify the attributes. The sentence classifier and attribute extractor are built with labels automatically produced by a weak supervision approach using infobox structured information as supervision source. We have compared our strategy with previous approaches to this problem on 29 different attributes from 4 domains. The results showed that our extraction pipeline achieved statistically superior performance in comparison with some baselines and variations of our approach.
... However, few are the techniques that have been provided in order to apply topic modeling over unstructured topics. [26] Proposed a framework for applying topic modeling to RDF graph data based on LDA, they highlighted some of the major challenges in using topic modeling over RDF data. These challenges are related to the sparseness and the unnatural language of the RDF graphs and gave some methods to tackle it. ...
... Topic models were firstly introduced for text documents and are easily adaptable to the case of RDF graphs. In [26] an approach to use topic modeling with RDF data was proposed using Latent Dirichlet Allocation (LDA) which is a commonly used model to identify the topics of documents. LDA aims to extract thematic information from documents' collections and it is based on the bag of words as vocabulary extracted from these documents. ...
... In our use case, the documents are the RDF Graph and the words are the extracted words from the graphs' triples. [26] introduced several limitations and challenges of using topic modeling for RDF graphs; Firstly, the sparseness of RDF data which means that even when having large datasets, the preprocessing of this data could result in a restricted set of words that could be used as a bag of words. Secondly, the lack of context is encountered since used words can have several meanings. ...
... Given the importance of missing type assertions task, there are some approaches based on Reasoning, Probabilistic Method [49], Hierarchical Classification Approach [40] [41] [60], Association Rule Mining [32] [46], Topic Modelling [52], Support Vector Machine (SVM) [51], k-nearest neighbors classifier (KNN) [45], and Tensor Factorization [44] have been proposed and applied on different in KGs are summarized in Table 1. All approaches have merits and demerits in performance for missing type assertions task in KG. ...
... In [52],the use of Topic Modeling for type prediction is proposed. Entities in a KG are represented as documents, on which Latent Dirichlet Allocation (LDA) [15] is applied for finding topics. ...
Conference Paper
Knowledge Graphs (KGs) have been proven to be incredibly useful for enriching semantic Web search results and allowing queries with a well-defined result set. In recent years much attention has been given to the task of inferring missing facts based on existing facts in a KG. Approaches have also been proposed for inferring types of entities, however these are successful in common types such as 'Person', 'Movie', or 'Actor'. There is still a large gap, however, in the inference of fine-grained types which are highly important for exploring specific lists and collections within web search. Generally there are also relatively fewer observed instances of fine-grained types present to train in KGs, and this poses challenges for the development of effective approaches. In order to address the issue, this paper proposes a new approach to the fine-grained type inference problem. This new approach is explicitly modeled for leveraging domain knowledge and utilizing additional data outside KG, that improves performance in fine-grained type inference. Further improvements in efficiency are achieved by extending the model to probabilistic inference based on entity similarity and typed class classification. We conduct extensive experiments on type triple classification and entity prediction tasks on Freebase FB15K benchmark dataset. The experiment results show that the proposed model outperforms the state-of-the-art approaches for type inference in KG, and achieves high performance results in many-to-one relation in predicting tail for KG completion task.
... The Semantic Web community has long investigated the methods to address the data linking problem, by identifying linked dataset quality assessment methodologies [30] and by proposing manual, semi-automatic or automatic tools to implement refinement operations [11,12]. The large majority of refinement approaches, especially on knowledge graphs in which scalable solutions are needed, are based on different statistical and machine learning techniques [17,8,23,24]. ...
Chapter
Full-text available
With the rise of linked data and knowledge graphs, the need becomes compelling to find suitable solutions to increase the coverage and correctness of datasets, to add missing knowledge and to identify and remove errors. Several approaches - mostly relying on machine learning and NLP techniques - have been proposed to address this refinement goal; they usually need a partial gold standard, i.e. some "ground truth" to train automatic models. Gold standards are manually constructed, either by involving domain experts or by adopting crowdsourcing and human computation solutions. In this paper, we present an open source software framework to build Games with a Purpose for linked data refinement, i.e. web applications to crowdsource partial ground truth, by motivating user participation through fun incentive. We detail the impact of this new resource by explaining the specific data linking "purposes" supported by the framework (creation, ranking and validation of links) and by defining the respective crowdsourcing tasks to achieve those goals. To show this resource's versatility, we describe a set of diverse applications that we built on top of it; to demonstrate its reusability and extensibility potential, we provide references to detailed documentation, including an entire tutorial which in a few hours guides new adopters to customize and adapt the framework to a new use case.
... RDF data is usually categorized as short text [13]. In addition, sparsity, unnatural language and lack of context [14] characteristics, make RDF data as unique data repository, which requires flexible techniques to extract useful information for the text mining tasks including tagging, summarization, and others. In order to overcome the short text related problems in different tasks, two main approaches have been utilized, based on supplementing the RDF data and adjusting the application to be able to handle short text properly. ...
... Topic models were originally introduced and used to discover latent topics in text corpora, however, they have been applied to other types of data, such as images [28], and recently Sleeman et al. [14] applied topic modeling to RDF graphs and showed its application for tasks such as predicting entity types, entity disambiguation, and community detection. Defining documents and word-like elements are the key points in applying topic models for various applications. ...
... Similarly, we also performed preprocessing on the RDF data and filtered out the schema and dataset dependent predicates, such as sameAs, wikiPageExternalLink, subject, wikiPageWikiLink. Since RDF data are recognized as short text [14], we need to address several challenges including sparseness, unnatural language, and the lack of context [14] to be able to run topic models on RDF data properly. Sparseness in RDF data is very common resulting from a limited number of words or phrases that play the role of subject, predicate or object in triples. ...
Conference Paper
Word embedding is becoming more popular in the Semantic Web community as an effective approach for capturing semantics in various contexts. In this paper, we combine word embedding and topic modeling to model RDF data for the entity summarization task. In our model, ES-LDA_ext, which is the extended version of our previous model, we utilize the word embedding to supplement the RDF data before applying entity summarization. In addition, in the model presented here, we use RDF literals as a very good source of information to create more reliable and representative summaries for entities. To do that, we use the Named Entity Recognition approach to extract entities within literals before feeding them into the word embedding model to enrich the RDF data. Experimental results demonstrate the effectiveness of the proposed model.
... Topic models were originally introduced for text documents, however, they have been applied to other types of data, such as images , and recently (Sleeman et al., 2015) used topic modeling for RDF graphs. The first step in applying topic models is to define documents and word-like elements as the basic building blocks of documents. ...
... We also performed preprocessing on the RDF data and filtered out the schema and dataset dependent predicates, such as sameAs, wikiPageExternalLink, subject, wikiPageWikiLink, in addition to literals. Since we work with RDF graphs that differ from typical text documents in the sense that RDF data are represented as triples, we need to address several challenges mentioned in (Sleeman et al., 2015) to be able to run topic models on RDF data. These challenges include sparseness, use of unnatural language, and the lack of context. ...
... As topic modeling is based on statistics of the cooccurrence of terms (Sleeman et al., 2015), when we are dealing with short texts with a very limited number of repetitions, which is the case with RDF data, we need to find a way to supplement the data to elevate the performance of the topic modeling approach. We augment the documents using two different methods. ...
Conference Paper
Full-text available
With the advent of the Internet, the amount of Semantic Web documents that describe real-world entities and their inter-links as a set of statements have grown considerably. These descriptions are usually lengthy, which makes the utilization of the underlying entities a difficult task. Entity summarization, which aims to create summaries for real world entities, has gained increasing attention in recent years. In this paper, we propose a probabilistic topic model, ES-LDA , that combines prior knowledge with statistical learning techniques within a single framework to create more reliable and representative summaries for entities. We demonstrate the effectiveness of our approach by conducting extensive experiments and show that our model outperforms the state-of-the-art techniques and enhances the quality of the entity summaries.
... On the other hand semantic similarity on knowledge graphs using ontology matching, ontology alignment, schema matching, instance matching, similarity search, etc. remains a challenge [15], [16], [17]. Sleeman et al. [18] used vectors based on topic models to compare the similarity of nodes in RDF graphs. In this paper we propose the VKG structure, in which we link the knowledge graph nodes to their embeddings in a vector space (see Section III). ...
Article
Full-text available
Knowledge graphs and vector space models are both robust knowledge representation techniques with their individual strengths and weaknesses. Vector space models excel at determining similarity between concepts, but they are severely constrained when evaluating complex dependency relations and other logic based operations that are a forte of knowledge graphs. In this paper, we propose the V-KG structure that helps us unify knowledge graphs and vector representation of entities, and allows us to develop powerful inference methods and search capabilities that combine their complementary strengths. We analogize this to thinking `fast' in vector space along with thinking `deeply' and `slowly' by reasoning over the knowledge graph. We have also created a query processing engine that takes complex queries and decomposes them into subqueries optimized to run on the respective knowledge graph part or the vector part of V-KG. We show that the V-KG structure can process specific queries that are not efficiently handled by vector spaces or knowledge graphs alone. We also demonstrate and evaluate the V-KG structure and the query processing engine by developing a system called Cyber-All-Intel for knowledge extraction, representation and querying in an end-to-end pipeline grounded in the cybersecurity informatics domain.
... Recently, Sleeman et al. [13] proposed an approach to use topic modelling with RDF data. While their work has a similar basis it differs in many ways since it aims at other use cases. ...
... We could identify different parts of a datasets metadata and show that the properties are most important for determining the datasets topic. Additionally, we created a gold standard for this task that can be downloaded from the projects web page 13 . ...
Conference Paper
The Web of data is growing continuously with respect to both the size and number of the datasets published. Porting a dataset to five-star Linked Data however requires the publisher of this dataset to link it with the already available linked datasets. Given the size and growth of the Linked Data Cloud, the current mostly manual approach used for detecting relevant datasets for linking is obsolete. We study the use of topic modelling for dataset search experimentally and present Tapioca, a linked dataset search engine that provides data publishers with similar existing datasets automatically. Our search engine uses a novel approach for determining the topical similarity of datasets. This approach relies on probabilistic topic modelling to determine related datasets by relying solely on the metadata of datasets. We evaluate our approach on a manually created gold standard and with a user study. Our evaluation shows that our algorithm outperforms a set of comparable baseline algorithms including standard search engines significantly by 6 % F1-score. Moreover, we show that it can be used on a large real world dataset with a comparable performance.
Article
Knowledge graph (KG) refinement refers to the process of filling in missing information, removing redundancies, and resolving inconsistencies in knowledge graphs. With the growing popularity of KG in various domains, many techniques involving machine learning have been applied, but there is no survey dedicated to machine learning-based KG refinement yet. Based on a novel framework following the KG refinement process, this paper presents a survey of machine learning approaches to KG refinement according to the kind of operations in KG refinement, the training datasets, mode of learning, and process multiplicity. Furthermore, the survey aims to provide broad practical insights into the development of fully automated KG refinement.