An example SPARQL query, using the Wikidata SPARQL endpoint (query.wikidata.org). It retrieves all Wikidata (WD) items which are of subclass protein-coding gene (Q840604), which have a chromosomal start position (P644) according to human genome build GRCh38 and reside on human chromosome (P659) 9 (Q20966585) and a chromosomal end position (P645) also on chromosome 9. Furthermore, the region of interest is restricted to a chromosomal start position between 21 and 30 megabase pairs. Colors: Red indicates SPARQL commands, blue represents variable names, green represents URIs and brown are strings. Arrows point to the source code the description applies to.

Source publication

Wikidata as a semantic framework for the Gene Wiki initiative

Article

Full-text available

Mar 2016

Open biological data are distributed over many resources making them challenging to integrate, to update and to disseminate quickly. Wikidata is a growing, open community database which can serve this purpose and also provides tight integration with Wikipedia. In order to improve the state of biological data, facilitate data management and dissemin...

Figure 2: Gene Wiki data model in Wikidata. Each entity (human gene,...

Wikidata as a semantic framework for the Gene Wiki initiative

Preprint

Full-text available

Nov 2015

Open biological data is distributed over many resources making it challenging to integrate, to update and to disseminate quickly. Wikidata is a growing, open community database which can serve this purpose and also provides tight integration with Wikipedia. In order to improve the state of biological data, facilitate data management and disseminati...

Nouvelles méthodes pour l'évaluation, l'évolution et l'interrogation des bases du Web des données

Thesis

Full-text available

Nov 2015

Pierre Maillot

Le Web des données offre un environnement de partage et de diffusion des données, selon un cadre particulier qui permet une exploitation des données tant par l’humain que par la machine. Pour cela, le framework RDF propose de formater les données en phrases élémentaires de la forme (sujet, relation, objet) , appelées triplets. Les bases du Web des...

Annotating Arabic Texts with Linked Data

Conference Paper

Full-text available

Dec 2020

The evolution of the traditional Web towards the semantic Web allows the machine to be a first-order citizen on the Web and increases discoverability of and accessibility to the unstructured data on the Web. This evolution enables the Linked Data technology to be used as background knowledge bases for unstructured data, notably the texts, available...

A Novel Framework for Extracting GeoSpatial Information Using SPARQL Query and Multiple Header Extraction Sources

Chapter

Full-text available

Apr 2016

In the current digital world, where all the data is available digitally, data extraction has become quite a massive field of research. Web mining uses the data mining techniques to discover knowledge from the Web. Such sort of technique can be very useful in some applications which require massive and updated data. One of such applications that can...

Wikidata subsetting: Approaches, tools, and evaluation

Article

Full-text available

Dec 2023

Wikidata is a massive Knowledge Graph (KG), including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper, we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting – WDSub, KGTK, WDumper, and WDF – in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. Results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.

Ten quick tips for editing Wikidata

Article

Full-text available

Jul 2023
PLOS COMPUT BIOL

A dual approach to ShEx visualization with complexity management

Preprint

Full-text available

May 2023

Shape Expressions (ShEx) are used in various fields of knowledge to define RDF graph structures. ShEx visualizations enable all kinds of users to better comprehend the underlying schemas and perceive its properties. Nevertheless, the only antecedent (RDFShape) suffers from limited scalability which impairs comprehension in large cases. In this work, a visual notation for ShEx is defined which is built upon operationalized principles for cognitively efficient design. Furthermore, two approaches to said notation with complexity management mechanisms are implemented: a 2D diagram (Shumlex) and a 3D Graph (3DShEx). A comparative user evaluation between both approaches and RDFShape was performed. Results show that Shumlex users were significantly faster than 3DShEx users in large schemas. Even though no significant differences were observed for success rates and precision, only Shumlex achieved a perfect score in both. Moreover, while users' ratings were mostly positive for all tools, their feedback was mostly favourable towards Shumlex. By contrast, RDFShape and 3DShEx's scalability is widely criticised. Given those results, it is concluded that Shumlex may have potential as a cognitively efficient visualization of ShEx. In contrast, the more intricate interaction with a 3D environment appears to hinder 3DShEx users.

Investigating the potential of the semantic web for education: Exploring Wikidata as a learning platform

Article

Full-text available

Mar 2023
Educ Inform Tech

Wikidata is a free, multilingual, open knowledge-base that stores structured, linked data. It has grown rapidly and as of December 2022 contains over 100 million items and millions of statements, making it the largest semantic knowledge-base in existence. Changing the interaction between people and knowledge, Wikidata offers various learning opportunities, leading to new applications in sciences, technology and cultures. These learning opportunities stem in part from the ability to query this data and ask questions that were difficult to answer in the past. They also stem from the ability to visualize query results, for example on a timeline or a map, which, in turn, helps users make sense of the data and draw additional insights from it. Research on the semantic web as learning platform and on Wikidata in the context of education is almost non-existent, and we are just beginning to understand how to utilize it for educational purposes. This research investigates the Semantic Web as a learning platform, focusing on Wikidata as a prime example. To that end, a methodology of multiple case studies was adopted, demonstrating Wikidata uses by early adopters. Seven semi-structured, in-depth interviews were conducted, out of which 10 distinct projects were extracted. A thematic analysis approach was deployed, revealing eight main uses, as well as benefits and challenges to engaging with the platform. The results shed light on Wikidata’s potential as a lifelong learning process, enabling opportunities for improved Data Literacy and a worldwide social impact.

Mapping Process for the Task: Wikidata Statements to Text as Wikipedia Sentences

Preprint

Full-text available

Oct 2022

Acknowledged as one of the most successful online cooperative projects in human society, Wikipedia has obtained rapid growth in recent years and desires continuously to expand content and disseminate knowledge values for everyone globally. The shortage of volunteers brings to Wikipedia many issues, including developing content for over 300 languages at the present. Therefore, the benefit that machines can automatically generate content to reduce human efforts on Wikipedia language projects could be considerable. In this paper, we propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level. The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia. We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models. The results are helpful not only for the data-to-text generation task but also for other relevant works in the field.

Using logical constraints to validate statistical information about disease outbreaks in collaborative knowledge graphs: the case of COVID-19 epidemiology in Wikidata

Article

Full-text available

Sep 2022

Urgent global research demands real-time dissemination of precise data. Wikidata, a collaborative and openly licensed knowledge graph available in RDF format, provides an ideal forum for exchanging structured data that can be verified and consolidated using validation schemas and bot edits. In this research article, we catalog an automatable task set necessary to assess and validate the portion of Wikidata relating to the COVID-19 epidemiology. These tasks assess statistical data and are implemented in SPARQL, a query language for semantic databases. We demonstrate the efficiency of our methods for evaluating structured non-relational information on COVID-19 in Wikidata, and its applicability in collaborative ontologies and knowledge graphs more broadly. We show the advantages and limitations of our proposed approach by comparing it to the features of other methods for the validation of linked web data as revealed by previous research.

Méthodes computationnelles pour améliorer les phases primaires de recherche de nouveaux médicaments

Thesis

Jun 2022

Jeremy Grignard

Le processus de découverte de nouveaux médicaments est long, coûteux et très risqué. L’objectif de cette thèse de doctorat est d’améliorer la pertinence des phases primaires de recherche pharmaceutique en développant des méthodes computationnelles. La première contribution porte sur le développement du graphe de connaissances Pegasus afin de capitaliser sur les données pharmaco-biologiques hétérogènes et de provenances multiples du secteur pharmaceutique. Les applications industrielles de Pegasus répondent à des problématiques de projets thérapeutiques et permettent de caractériser des effets hors cibles de perturbateurs, de concevoir une nouvelle expérience, et d’identifier des librairies de criblage focalisées. La deuxième contribution porte sur le développement d’un algorithme d’identification de composés contrôles positifs et d’un algorithme de normalisation afin d’améliorer la conception et l’analyse d’expériences de criblage phénotypiques à haut contenu. Ces algorithmes permettent de normaliser les signatures phénotypiques obtenues à partir de campagnes de criblage et d’intégrer des similarités phénotypiques informatives dans le graphe de connaissances Pegasus. La troisième contribution porte sur le développement d’un modèle mathématique du cycle de tyrosination des microtubules qui explique, d’une part, l’inactivité de composés chimiques dans les cellules montrés actifs hors cellule, et d’autre part, suggère la nécessité d’activer deux réactions de ce cycle, en synergie, pour obtenir un effet dans les modèles cellulaires. Ceci illustre l’apport de la modélisation mathématique pour, d’une part, prédire et comprendre la dynamique contre-intuitive de processus biochimiques qui n’est pas représentable par des graphes de connaissances statiques comme dans Pegasus, et d’autre part, guider la conception de nouvelles expériences de criblage. Les contributions scientifiques et les applications industrielles de cette thèse sont développées dans le cadre des phases primaires de recherche de nouveaux médicaments et ont vocation à s’étendre aux phases cliniques du processus pharmaceutique.

WikiProject Clinical Trials for Wikidata

Preprint

Full-text available

Apr 2022

WikiProject Clinical Trials is a Wikidata community project to integrate clinical trials metadata with the Wikipedia ecosystem. Using Wikidata methods for data modeling, import, querying, curating, and profiling, the project brought ClinicalTrials.gov records into Wikidata and enriched them. The motivation for the project was gaining the benefits of hosting in Wikidata, which include distribution to new audiences and staging the content for the Wikimedia editor community to develop it further. Project pages present options for engaging with the content in the Wikidata environment. Example applications include generation of web-based profiles of clinical trials by medical condition, research intervention, research site, principal investigator, and funder. The project’s curation workflows including entity disambiguation and language translation could be expanded when there is a need to make subsets of clinical trial information more accessible to a given community. This project’s methods could be adapted for other clinical trial registries, or as a model for using Wikidata to enrich other metadata collections.

Creating Knowledge Graphs Subsets using Shape Expressions

Preprint

Full-text available

Oct 2021

Jose Emilio Labra Gayo

The initial adoption of knowledge graphs by Google and later by big companies has increased their adoption and popularity. In this paper we present a formal model for three different types of knowledge graphs which we call RDF-based graphs, property graphs and wikibase graphs. In order to increase the quality of Knowledge Graphs, several approaches have appeared to describe and validate their contents. Shape Expressions (ShEx) has been proposed as concise language for RDF validation. We give a brief introduction to ShEx and present two extensions that can also be used to describe and validate property graphs (PShEx) and wikibase graphs (WShEx). One problem of knowledge graphs is the large amount of data they contain, which jeopardizes their practical application. In order to palliate this problem, one approach is to create subsets of those knowledge graphs for some domains. We propose the following approaches to generate those subsets: Entity-matching, simple matching, ShEx matching, ShEx plus Slurp and ShEx plus Pregel which are based on declaratively defining the subsets by either matching some content or by Shape Expressions. The last approach is based on a novel validation algorithm for ShEx based on the Pregel algorithm that can handle big data graphs and has been implemented on Apache Spark GraphX.

Representing COVID-19 information in collaborative knowledge graphs: The case of Wikidata

Article

Full-text available

Sep 2021

Information related to the COVID-19 pandemic ranges from biological to bibliographic, from geographical to genetic and beyond. The structure of the raw data is highly complex, so converting it to meaningful insight requires data curation, integration, extraction and visualization, the global crowdsourcing of which provides both additional challenges and opportunities. Wikidata is an interdisciplinary, multilingual, open collaborative knowledge base of more than 90 million entities connected by well over a billion relationships. It acts as a web-scale platform for broader computer-supported cooperative work and linked open data, since it can be written to and queried in multiple ways in near real time by specialists, automated tools and the public. The main query language, SPARQL, is a semantic language used to retrieve and process information from databases saved in Resource Description Framework (RDF) format. Here, we introduce four aspects of Wikidata that enable it to serve as a knowledge base for general information on the COVID-19 pandemic: its flexible data model, its multilingual features, its alignment to multiple external databases, and its multidisciplinary organization. The rich knowledge graph created for COVID-19 in Wikidata can be visualized, explored, and analyzed for purposes like decision support as well as educational and scholarly research.

Similar publications

Citations