An example SPARQL query, using the Wikidata SPARQL endpoint (query.wikidata.org). It retrieves all Wikidata (WD) items which are of subclass protein-coding gene (Q840604), which have a chromosomal start position (P644) according to human genome build GRCh38 and reside on human chromosome (P659) 9 (Q20966585) and a chromosomal end position (P645) also on chromosome 9. Furthermore, the region of interest is restricted to a chromosomal start position between 21 and 30 megabase pairs. Colors: Red indicates SPARQL commands, blue represents variable names, green represents URIs and brown are strings. Arrows point to the source code the description applies to.

An example SPARQL query, using the Wikidata SPARQL endpoint (query.wikidata.org). It retrieves all Wikidata (WD) items which are of subclass protein-coding gene (Q840604), which have a chromosomal start position (P644) according to human genome build GRCh38 and reside on human chromosome (P659) 9 (Q20966585) and a chromosomal end position (P645) also on chromosome 9. Furthermore, the region of interest is restricted to a chromosomal start position between 21 and 30 megabase pairs. Colors: Red indicates SPARQL commands, blue represents variable names, green represents URIs and brown are strings. Arrows point to the source code the description applies to.

Source publication
Article
Full-text available
Open biological data are distributed over many resources making them challenging to integrate, to update and to disseminate quickly. Wikidata is a growing, open community database which can serve this purpose and also provides tight integration with Wikipedia. In order to improve the state of biological data, facilitate data management and dissemin...

Similar publications

Preprint
Full-text available
Open biological data is distributed over many resources making it challenging to integrate, to update and to disseminate quickly. Wikidata is a growing, open community database which can serve this purpose and also provides tight integration with Wikipedia. In order to improve the state of biological data, facilitate data management and disseminati...
Thesis
Full-text available
Le Web des données offre un environnement de partage et de diffusion des données, selon un cadre particulier qui permet une exploitation des données tant par l’humain que par la machine. Pour cela, le framework RDF propose de formater les données en phrases élémentaires de la forme (sujet, relation, objet) , appelées triplets. Les bases du Web des...
Conference Paper
Full-text available
The evolution of the traditional Web towards the semantic Web allows the machine to be a first-order citizen on the Web and increases discoverability of and accessibility to the unstructured data on the Web. This evolution enables the Linked Data technology to be used as background knowledge bases for unstructured data, notably the texts, available...
Chapter
Full-text available
In the current digital world, where all the data is available digitally, data extraction has become quite a massive field of research. Web mining uses the data mining techniques to discover knowledge from the Web. Such sort of technique can be very useful in some applications which require massive and updated data. One of such applications that can...

Citations

... So it can grow and become a new topic-specific KG ...". 8 For example, in the case of extracting a life science subset of Wikidata, the extracted subset can be considered a life science knowledge graph, which can subsequently be enriched with additional triples, creating a new Life Science KG based on the Wikidata data model and enriching its contents with other contents. ...
... This project is one of the most active WikiProjects in terms of human and bot contribution [4]. The project is initiated based on a class-level diagram of the Wikidata knowledge graph for biomedical entities, which specifies 17 main classes [8]. The Wikidata WikiProject has extended the classes into 24 item classes. ...
Article
Full-text available
Wikidata is a massive Knowledge Graph (KG), including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper, we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting – WDSub, KGTK, WDumper, and WDF – in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. Results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.
... This model of statements is common to linked data repositories aligned to the Semantic Web [14][15][16], and Wikidata extends it with qualifiers and references that enable capturing specific detail and provenance (see Tip 7). For example, the statement Retinoic acid receptor alpha (Q254943) physically interacts with (P129) tretinoin (Q29417), with the role (P2868) of agonist (Q389934) cites as a reference that it is stated in (P248) the IUPHAR/BPS database (Q17091219). ...
... For example, the Wikidata integrator library can update items based on external resources and then confirm data consistency via SPARQL queries. It is used by multiple python bots to keep biology topics up to date, such as genes, diseases, and drugs (ProteinBoxBot) [14], or cell lines (CellosaurusBot) [26]. Once you start to add statements to an item (especially instance of/subclass of), the interface will begin to suggest common properties to add that other similar items include. ...
... This means that some level of inconsistency and incompleteness in its contents is currently inevitable [8,[29][30][31]. There is thorough coverage of some items, such as protein classes [32], human genes [14], cell types [26,33], and metabolic pathways [34]. However, this is not true across all topics, and inconsistencies fall into a few categories (Box 2). ...
... The Wikidata GeneWiki [4] project aims to use Wikidata as a semantic framework to manage and disseminate biomedical data. To that end, it describes a knowledge graph about such entities and their relationships. ...
Preprint
Full-text available
Shape Expressions (ShEx) are used in various fields of knowledge to define RDF graph structures. ShEx visualizations enable all kinds of users to better comprehend the underlying schemas and perceive its properties. Nevertheless, the only antecedent (RDFShape) suffers from limited scalability which impairs comprehension in large cases. In this work, a visual notation for ShEx is defined which is built upon operationalized principles for cognitively efficient design. Furthermore, two approaches to said notation with complexity management mechanisms are implemented: a 2D diagram (Shumlex) and a 3D Graph (3DShEx). A comparative user evaluation between both approaches and RDFShape was performed. Results show that Shumlex users were significantly faster than 3DShEx users in large schemas. Even though no significant differences were observed for success rates and precision, only Shumlex achieved a perfect score in both. Moreover, while users' ratings were mostly positive for all tools, their feedback was mostly favourable towards Shumlex. By contrast, RDFShape and 3DShEx's scalability is widely criticised. Given those results, it is concluded that Shumlex may have potential as a cognitively efficient visualization of ShEx. In contrast, the more intricate interaction with a 3D environment appears to hinder 3DShEx users.
... More Specifically, review of the literature illustrated that researchers use Wikidata to conduct new types of research (Amaral et al., 2021;Colla et al., 2021;Ferradji & Benchikha, 2021;Good et al., 2016;Kaffee, 2016;Konieczny & Klein, 2018;Lemus-Rojas & Odell, 2018;Li et al., 2022;Meier, 2022;Mietchen et al., 2015;Morshed, 2021;Neelam et al., 2022;Rasberry & Mietchen, 2021;Shenoy et al., 2022;Taveekarn et al., 2019;Waagmeester et al., 2020Waagmeester et al., , 2021Zhang et al., 2022). Researchers also use Wikidata to conduct new types of academic analysis in a variety of disciplines (Arnaout et al., 2021;Burgstaller-Muehlbacher et al., 2016;Kaffee et al., 2017;Klein et al., 2016;Lemus-Rojas, n.d.;Pfundner et al., 2015;Putman et al., 2017;Rutz et al., 2021;Scharpf et al., 2021a, b;Turki et al., 2019Turki et al., , 2022a. Finally, at times researchers use Wikidata to demonstrate new types of visualizations (Hernández et al., 2016;Metilli et al., 2019;Nielsen et al., 2017;Nielsen, 2016a, b). ...
Article
Full-text available
Wikidata is a free, multilingual, open knowledge-base that stores structured, linked data. It has grown rapidly and as of December 2022 contains over 100 million items and millions of statements, making it the largest semantic knowledge-base in existence. Changing the interaction between people and knowledge, Wikidata offers various learning opportunities, leading to new applications in sciences, technology and cultures. These learning opportunities stem in part from the ability to query this data and ask questions that were difficult to answer in the past. They also stem from the ability to visualize query results, for example on a timeline or a map, which, in turn, helps users make sense of the data and draw additional insights from it. Research on the semantic web as learning platform and on Wikidata in the context of education is almost non-existent, and we are just beginning to understand how to utilize it for educational purposes. This research investigates the Semantic Web as a learning platform, focusing on Wikidata as a prime example. To that end, a methodology of multiple case studies was adopted, demonstrating Wikidata uses by early adopters. Seven semi-structured, in-depth interviews were conducted, out of which 10 distinct projects were extracted. A thematic analysis approach was deployed, revealing eight main uses, as well as benefits and challenges to engaging with the platform. The results shed light on Wikidata’s potential as a lifelong learning process, enabling opportunities for improved Data Literacy and a worldwide social impact.
... The input, structured data, taken from available data sources (DBPedia [2,4,21,23], Wikidata [3,4,7,21,23]), is organized in different formats, from RDF triples [2,3,4,9], tables [5,7,8] (Wikipedia infoboxes, slot-value pairs), to knowledge graphs [22,24] built from RDF triples. In our paper, we refer from the works [1,27,28,29,30] to format Wikidata statements in a set of quads and triples which is possible to transform to RDF triples for creating knowledge graphs. ...
... Also, when using (29), there may be some variants of (28) where we must plus 1 to avoid the problem of zero logarithms. To calculate the sum and product distances of a labeled sentence by TF or IDF, we apply the same formulas as (27) and (28) but much simpler. ...
Preprint
Full-text available
Acknowledged as one of the most successful online cooperative projects in human society, Wikipedia has obtained rapid growth in recent years and desires continuously to expand content and disseminate knowledge values for everyone globally. The shortage of volunteers brings to Wikipedia many issues, including developing content for over 300 languages at the present. Therefore, the benefit that machines can automatically generate content to reduce human efforts on Wikipedia language projects could be considerable. In this paper, we propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level. The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia. We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models. The results are helpful not only for the data-to-text generation task but also for other relevant works in the field.
... Wikidata currently serves as a semantic framework for a variety of scientific initiatives ranging from genetics (Burgstaller-Muehlbacher et al., 2016) to invasion biology (Jeschke et al., 2021) and clinical trials (Rasberry et al., 2022), allowing different teams of scholars, volunteers and others to integrate valuable academic data into a collective and standardized pool. Its versatility and interconnectedness are making it an example for interdisciplinary data integration and dissemination across fields as diverse as linguistics, information technology, film studies, and medicine (Turki et al., 2019;Mitraka et al., 2015;Mietchen et al., 2015;Waagmeester, Schriml & Su, 2019;Turki et al., 2017;Wasi, Sachan & Darbari, 2020;Heftberger et al., 2020), including disease outbreaks like those caused by the Zika virus (Ekins et al., 2016) or SARS-CoV-2 (Turki et al., 2022). ...
Article
Full-text available
Urgent global research demands real-time dissemination of precise data. Wikidata, a collaborative and openly licensed knowledge graph available in RDF format, provides an ideal forum for exchanging structured data that can be verified and consolidated using validation schemas and bot edits. In this research article, we catalog an automatable task set necessary to assess and validate the portion of Wikidata relating to the COVID-19 epidemiology. These tasks assess statistical data and are implemented in SPARQL, a query language for semantic databases. We demonstrate the efficiency of our methods for evaluating structured non-relational information on COVID-19 in Wikidata, and its applicability in collaborative ontologies and knowledge graphs more broadly. We show the advantages and limitations of our proposed approach by comparing it to the features of other methods for the validation of linked web data as revealed by previous research.
... Certains consortiums, comme la fondation OBO, centralisent et essayent de rendre interopérables les ontologies du domaine biomédical [44]. D'autres ontologies sont constituées de concepts variés comme les projets Life Sciences Linked Open Data et WikiDataGene [52,53]. ...
Thesis
Le processus de découverte de nouveaux médicaments est long, coûteux et très risqué. L’objectif de cette thèse de doctorat est d’améliorer la pertinence des phases primaires de recherche pharmaceutique en développant des méthodes computationnelles. La première contribution porte sur le développement du graphe de connaissances Pegasus afin de capitaliser sur les données pharmaco-biologiques hétérogènes et de provenances multiples du secteur pharmaceutique. Les applications industrielles de Pegasus répondent à des problématiques de projets thérapeutiques et permettent de caractériser des effets hors cibles de perturbateurs, de concevoir une nouvelle expérience, et d’identifier des librairies de criblage focalisées. La deuxième contribution porte sur le développement d’un algorithme d’identification de composés contrôles positifs et d’un algorithme de normalisation afin d’améliorer la conception et l’analyse d’expériences de criblage phénotypiques à haut contenu. Ces algorithmes permettent de normaliser les signatures phénotypiques obtenues à partir de campagnes de criblage et d’intégrer des similarités phénotypiques informatives dans le graphe de connaissances Pegasus. La troisième contribution porte sur le développement d’un modèle mathématique du cycle de tyrosination des microtubules qui explique, d’une part, l’inactivité de composés chimiques dans les cellules montrés actifs hors cellule, et d’autre part, suggère la nécessité d’activer deux réactions de ce cycle, en synergie, pour obtenir un effet dans les modèles cellulaires. Ceci illustre l’apport de la modélisation mathématique pour, d’une part, prédire et comprendre la dynamique contre-intuitive de processus biochimiques qui n’est pas représentable par des graphes de connaissances statiques comme dans Pegasus, et d’autre part, guider la conception de nouvelles expériences de criblage. Les contributions scientifiques et les applications industrielles de cette thèse sont développées dans le cadre des phases primaires de recherche de nouveaux médicaments et ont vocation à s’étendre aux phases cliniques du processus pharmaceutique.
... On that basis, since the launch of Wikidata, [14] similar research has been undertaken on its quality, [15] including its medical content [16,17,18,19], multilingual aspects [20,21] and the potential for integration with research workflows. [22,23,24,25,1] The Wikimedia editorial community has created and maintains a governance process which uses consensus of participants as the basis of authority. [12] One of the community values is encouraging universal accessibility through language translation of content, so content develops beyond the original language. ...
Preprint
Full-text available
WikiProject Clinical Trials is a Wikidata community project to integrate clinical trials metadata with the Wikipedia ecosystem. Using Wikidata methods for data modeling, import, querying, curating, and profiling, the project brought ClinicalTrials.gov records into Wikidata and enriched them. The motivation for the project was gaining the benefits of hosting in Wikidata, which include distribution to new audiences and staging the content for the Wikimedia editor community to develop it further. Project pages present options for engaging with the content in the Wikidata environment. Example applications include generation of web-based profiles of clinical trials by medical condition, research intervention, research site, principal investigator, and funder. The project’s curation workflows including entity disambiguation and language translation could be expanded when there is a need to make subsets of clinical trial information more accessible to a given community. This project’s methods could be adapted for other clinical trial registries, or as a model for using Wikidata to enrich other metadata collections.
... A remarkable case is Google, which migrated its previous knowledge graph Freebase to Wikidata in 2017 [74]. Apart of Wikipedia, Wikidata has been reported to be used by external applications like Apple's Siri 9 and it has been adopted as the central hub for knowledge in several domains like life sciences [9], libraries [67] or social science [13]. As of August, 2021, it contains information about more than 94 millions of entities 10 and since its launch there have been more than 1,400 millions of edits. ...
... A remarkable case is Google, which migrated its previous knowledge graph Freebase to Wikidata in 2017 [74]. Apart of Wikipedia, Wikidata has been reported to be used by external applications like Apple's Siri 9 and it has been adopted as the central hub for knowledge in several domains like life sciences [9], libraries [67] or social science [13]. As of August, 2021, it contains information about more than 94 millions of entities 10 and since its launch there have been more than 1,400 millions of edits. ...
Preprint
Full-text available
The initial adoption of knowledge graphs by Google and later by big companies has increased their adoption and popularity. In this paper we present a formal model for three different types of knowledge graphs which we call RDF-based graphs, property graphs and wikibase graphs. In order to increase the quality of Knowledge Graphs, several approaches have appeared to describe and validate their contents. Shape Expressions (ShEx) has been proposed as concise language for RDF validation. We give a brief introduction to ShEx and present two extensions that can also be used to describe and validate property graphs (PShEx) and wikibase graphs (WShEx). One problem of knowledge graphs is the large amount of data they contain, which jeopardizes their practical application. In order to palliate this problem, one approach is to create subsets of those knowledge graphs for some domains. We propose the following approaches to generate those subsets: Entity-matching, simple matching, ShEx matching, ShEx plus Slurp and ShEx plus Pregel which are based on declaratively defining the subsets by either matching some content or by Shape Expressions. The last approach is based on a novel validation algorithm for ShEx based on the Pregel algorithm that can handle big data graphs and has been implemented on Apache Spark GraphX.
... The collaborative work in Wikidata to populate and curate this data has been largely accomplished by WikiProject COVID-19, 11 launched in March 2020 [105]. This WikiProject itself has a Wikidata item [Q87748614], and items are linked to it using the property on focus list of Wikimedia project [P5008]. ...
... Such distantly related entities are also available in other open knowledge graphs, particularly DBpedia and YAGO, and contribute much to the value of a semantic resource [30,84]. In Wikidata, several initiatives such as WikiCite for scholarly information [67,70,92,107] and Gene Wiki for genomic data [11] have enabled COVID-19 knowledge graphs to include classes like genes [Q7187], proteins [Q8054] or biological processes [Q2996394], along with the definition of semantic relations between items closely and distantly related to COVID-19. This, consequently, allows the expansion of the coverage of COVID-19 information in Wikidata and a better characterization of COVID-19-related items. ...
... These links make Wikidata a key node of the open data ecosystem, not only contributing its own items and internal links, but also bridging between other open databases ( Fig. 3). Wikidata therefore supports alignment between disparate knowledge bases and, consequently, semantic data integration [11] and federation [65] in the context of the linked open data cloud [21]. Such statements also permit the enrichment of Wikidata items with data from external databases when these resources are updated, particularly in relation with the regular changes of the multiple characteristics of COVID-19. ...
Article
Full-text available
Information related to the COVID-19 pandemic ranges from biological to bibliographic, from geographical to genetic and beyond. The structure of the raw data is highly complex, so converting it to meaningful insight requires data curation, integration, extraction and visualization, the global crowdsourcing of which provides both additional challenges and opportunities. Wikidata is an interdisciplinary, multilingual, open collaborative knowledge base of more than 90 million entities connected by well over a billion relationships. It acts as a web-scale platform for broader computer-supported cooperative work and linked open data, since it can be written to and queried in multiple ways in near real time by specialists, automated tools and the public. The main query language, SPARQL, is a semantic language used to retrieve and process information from databases saved in Resource Description Framework (RDF) format. Here, we introduce four aspects of Wikidata that enable it to serve as a knowledge base for general information on the COVID-19 pandemic: its flexible data model, its multilingual features, its alignment to multiple external databases, and its multidisciplinary organization. The rich knowledge graph created for COVID-19 in Wikidata can be visualized, explored, and analyzed for purposes like decision support as well as educational and scholarly research.