Article

Hotness prediction of scientific topics based on a bibliographic knowledge graph

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

As a part of innovation in forecasting, scientific topic hotness prediction plays an essential role in dynamic scientific topic assessment and domain knowledge transformation modeling. To improve the topic hotness prediction performance, we propose an innovative model to estimate the co-evolution of scientific topic and bibliographic entities, which leverages a novel dynamic Bibliographic Knowledge Graph (BKG). Then, one can predict the topic hotness by using various kinds of topological entity information, i.e., TopicRank, PaperRank, AuthorRank, and VenueRank, along with pre-trained node embedding, i.e., node2vec embedding, and different pooling techniques. To validate the proposed method, we constructed a new BKG by using 4.5 million PubMed Central publications plus MeSH (Medical Subject Heading) thesaurus and witnessed the essential prediction improvement with extensive experiment outcomes over 10 years observations.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... These subgraphs can be used to identify past topic evolution and predict future topic evolution based on structural and temporal features of topic structures. The dynamic Bibliographic Knowledge Graph (BKG) [35] is the result of the merging of bibliographic entities (PMC) with a biomedical and life sciences thesaurus (MeSH). Using different types of topic, paper, author and venue rankings, together with pre-trained node embedding and various pooling techniques, it can be applied to predict topic hotness. ...
Chapter
Full-text available
This paper presents ATEM, a novel framework for studying topic evolution in scientific archives. ATEM employs dynamic topic modeling and dynamic graph embedding to explore the dynamics of content and citations within a scientific corpus. ATEM explores a new notion of citation context that uncovers emerging topics by analyzing the dynamics of citation links between evolving topics. Our experiments demonstrate that ATEM can efficiently detect emerging cross-disciplinary topics within the DBLP archive of over five million computer science articles.
... These subgraphs can be used to identify past topic evolution and predict future topic evolution based on structural and temporal features of topic structures. The dynamic Bibliographic Knowledge Graph (BKG) [35] is the result of the merging of bibliographic entities (PMC) with a biomedical and life sciences thesaurus (MeSH). Using different types of topic, paper, author and venue rankings, together with pre-trained node embedding and various pooling techniques, it can be applied to predict topic hotness. ...
Preprint
This paper presents ATEM, a novel framework for studying topic evolution in scientific archives. ATEM is based on dynamic topic modeling and dynamic graph embedding techniques that explore the dynamics of content and citations of documents within a scientific corpus. ATEM explores a new notion of contextual emergence for the discovery of emerging interdisciplinary research topics based on the dynamics of citation links in topic clusters. Our experiments show that ATEM can efficiently detect emerging cross-disciplinary topics within the DBLP archive of over five million computer science articles.
... As a new paradigm of knowledge processing, the development of knowledge graphs brings new possibilities for UAV knowledge management [3]. Objectively, knowledge graphs have more powerful data synthesis governance capabilities, and on a large scale, multi-source and different forms of UAV knowledge data can be deeply mined and represent the semantic relation and knowledge systems the graph contains. ...
Article
Full-text available
Accurate target recognition of unmanned aerial vehicles (UAVs) in the intelligent warfare mode relies on a highly standardized UAV knowledge base, and thus it is crucial to construct a knowledge graph suitable for UAV multi-source information fusion. However, due to the lack of domain knowledge and the cumbersome and inefficient construction techniques, the intelligent construction approaches of knowledge graphs for UAVs are relatively backward. To this end, this paper proposes a framework for the construction and application of a standardized knowledge graph from large-scale UAV unstructured data. First, UAV concept classes and relations are defined to form specialized ontology, and UAV knowledge extraction triples are labeled. Then, a two-stage knowledge extraction model based on relational attention-based contextual semantic representation (UASR) is designed based on the characteristics of the UAV knowledge extraction corpus. The contextual semantic representation is then applied to the downstream task as a key feature through the Multilayer Perceptron (MLP) attention method, while the relation attention mechanism-based approach is used to calculate the relational-aware contextual representation in the subject–object entity extraction stage. Extensive experiments were carried out on the final annotated dataset, and the model F1 score reached 70.23%. Based on this, visual presentation is achieved based on the UAV knowledge graph, which lays the foundation for the back-end application of the UAV knowledge graph intelligent construction technology.
... Behind these strategies and tools, there are various fundamental scientific NLP tasks and datasets support. Identifying the multi-granularity function of a keyword [42], a sentence [28], or a citation [7] in the scientific paper is critical for downstream tasks, such as impact prediction [22,50,75], novelty measurement [44] and emerging topic prediction [25,35]. ...
Preprint
Fine-tuning pre-trained language models (PLMs), e.g., SciBERT, generally requires large numbers of annotated data to achieve state-of-the-art performance on a range of NLP tasks in the scientific domain. However, obtaining the fine-tune data for scientific NLP task is still challenging and expensive. Inspired by recent advancement in prompt learning, in this paper, we propose the Mix Prompt Tuning (MPT), which is a semi-supervised method to alleviate the dependence on annotated data and improve the performance of multi-granularity academic function recognition tasks with a small number of labeled examples. Specifically, the proposed method provides multi-perspective representations by combining manual prompt templates with automatically learned continuous prompt templates to help the given academic function recognition task take full advantage of knowledge in PLMs. Based on these prompt templates and the fine-tuned PLM, a large number of pseudo labels are assigned to the unlabeled examples. Finally, we fine-tune the PLM using the pseudo training set. We evaluate our method on three academic function recognition tasks of different granularity including the citation function, the abstract sentence function, and the keyword function, with datasets from computer science domain and biomedical domain. Extensive experiments demonstrate the effectiveness of our method and statistically significant improvements against strong baselines. In particular, it achieves an average increase of 5% in Macro-F1 score compared with fine-tuning, and 6% in Macro-F1 score compared with other semi-supervised method under low-resource settings. In addition, MPT is a general method that can be easily applied to other low-resource scientific classification tasks.
Article
Full-text available
To analyze the impact of urban socioeconomic and demographic factors on fire occurrences and predict fire occurrences according to each factor, this study analyzed the correlation between “Korean social indicators” and fire occurrences. Based on this, a fire prediction model was built based on multi-layer perceptron (MLP). For this purpose, data on social indicators and the number of fires by city, county, and district from 2015 to 2022 were collected, and the correlation between social indicators and fire occurrences were analyzed. Based on the correlation analysis results, two models were built to predict fires using 15 factors (Model 1) and 5 factors (Model 2). The mean absolute percentage error of the models were 26.37% (Model 1) and 30.92% (Model 2), confirming the usability of the fire prediction model based on the multi-layer perceptron using social indicators.
Article
The paper presents scientifically grounded principles of formation and algorithms of a global bibliographic retrieval system based on modern intelligent technologies (intelligent linguistic analysis of texts and database management) and Internet search methods. Within the framework of theoretical bibliography, for the first time taking into account the specifics of modern intelligent software, a new algorithm of an integrated automated system for global bibliographic retrieval as well as search for specialized publications to publish research papers was developed. The developed methods include the formation of a normalized database of keywords based on the Universal Decimal Classification (UDC) categories and the automatic assignment of keywords from this normalized database to research papers by an intelligent system for linguistic analysis of texts. The proposed use of keywords as data labels in the form of UDC categories, covering all branches of knowledge, makes it possible to precisely label and describe research papers and scientific publications (journals, proceedings of scientific conferences) to the fullest extent possible. The search for research papers by the proposed bibliographic retrieval system as well as the search for specialized publications to publish those research papers is carried out automatically by normalized keywords assigned to research papers and scientific journals based on intelligent linguistic analysis of the content of those research papers. The study makes a social contribution and shows how it is important to create and utilize best practices, processes and strategies of bibliographic retrieval, which are essential in the development of science in modern society. Besides, in social terms, the search for bibliographic information is one of the methods to solve the problem of finding a relation between personal and public knowledge. The implementation of the study results in accordance with the presented algorithms consists in designing a special global public Internet search resource.
Article
Machine understanding and thinking require prior knowledge consisting of explicit and implicit knowledge. The current knowledge base contains various explicit knowledge but not implicit knowledge. As part of implicit knowledge, the typical characteristics of the things referred to by the concept are available by concept cognition for knowledge graphs. Therefore, this paper attempts to realize concept cognition for knowledge graphs from the perspective of mining multigranularity decision rules. Specifically, (1) we propose a novel multigranularity three-way decision model that merges the ideas of multigranularity (i.e., from coarse granularity to fine granularity) and three-way decision (i.e., acceptance, rejection, and deferred decision). (2) Based on the multigranularity three-way decision model, an algorithm for mining multigranularity decision rules is proposed. (3) The monotonicity of positive or negative granule space ensured that the positive (or negative) granule space from coarser granularity does not need to participate in the three-classification process at a finer granularity, which accelerates the process of mining multigranularity decision rules. Moreover, the experimental results show that the multigranularity decision rule is better than the two-way decision rule, frequent decision rule and single granularity decision rule, and the monotonicity of positive or negative granule space can accelerate the process of mining multigranularity decision rules.
Article
In response to the exponential growth of the volume of scientific publications, researchers have proposed a multitude of information extraction methods for extracting entities and relations, such as task, dataset, metric, and method entities. However, the existing methods cannot directly provide readers with procedural scientific information that demonstrates the path to the problem's solution. From the perspective of applied science, we propose a novel schema for the applied AI community, namely a metric-driven mechanism schema (Operation, Effect, Direction). Our schema depicts the procedural scientific information concerning “How to optimize the quantitative metrics for a specific task?” In this paper, we choose papers in the domain of NLP for our study, which is a representative branch of Artificial Intelligence (AI). Specifically, we first construct a dataset that covers the metric-driven mechanisms in single and multiple sentences. Then we propose a framework for extracting metric-driven mechanisms, which includes three sub-models: 1) a mechanism detection model, 2) a query-guided seq2seq mechanism extraction model, and 3) a task recognition model. Finally, a metric-driven mechanism knowledge graph, named MKG_NLP, is constructed. Our MKG_NLP has over 43K n-ary mechanism relations in the form of (Operation, Effect, Direction, Task). The human evaluation shows that the extracted metric-driven mechanisms in MKG_NLP achieve 81.4% accuracy. Our model also shows the potential for creating applications to assist applied AI scientists to solve specific problems.
Article
Full-text available
Computer science discipline includes many research fields, which mutually influence and promote each other’s development. This poses two great challenges of predicting the research topics of each research field. One is how to model fine-grained topic representation of a research field. The other is how to model research topic of different fields and keep the semantic consistency of research topics when learning the scientific influence context from other related fields. Unfortunately, the existing research topic prediction approaches cannot handle these two challenges. To solve these problems, we employ multiple different Recurrent Neural Network chains which model research topics of different fields and propose a research topic prediction model based on spatial attention and semantic consistency-based scientific influence modeling. Spatial attention is employed in field topic representation which can selectively extract the attributes from the field topics to distinguish the importance of field topic attributes. Semantic consistency-based scientific influence modeling maps research topics of different fields to a unified semantic space to obtain the scientific influence context of other related fields. Extensive experiment results on five related research fields in the computer science (CS) discipline show that the proposed model is superior to the most advanced methods and achieves good topic prediction performance.
Article
Full-text available
The prediction of exceptional or surprising growth in research is an issue with deep roots and few practical solutions. In this study, we develop and validate a novel approach to forecasting growth in highly specific research communities. Each research community is represented by a cluster of papers. Multiple indicators were tested, and a composite indicator was created that predicts which research communities will experience exceptional growth over the next three years. The accuracy of this predictor was tested using hundreds of thousands of community-level forecasts and was found to exceed the performance benchmarks established in Intelligence Advanced Research Projects Activity's (IARPA) Foresight Using Scientific Exposition (FUSE) program in six of nine major fields in science. Furthermore, 10 of 11 disciplines within the Computing Technologies field met the benchmarks. Specific detailed forecast examples are given and evaluated, and a critical evaluation of the forecasting approach is also provided.
Article
Full-text available
PubMed® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID®, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities.
Article
Full-text available
Disease prediction helps prevent disease and early diagnosis, and accurate classification of patients greatly improves the accuracy of disease prediction. Today’s massive multi-dimensional medical data and its similarity algorithms provide the basis for the classification of clinical diseases. Based on this, we randomly generated the simulated clinical data of ICD10 structure, used the improved similarity algorithm to calculate the similarity and classification of the two patients, and found the patients belonging to different disease categories in the classified patient group. This finding provides a scientific basis for the correction of genetic algorithms and genetic research.
Article
Full-text available
Recently, a number of similarity-based methods have been proposed for link prediction of complex networks. Among these indices, the resource-allocation-based prediction methods perform very well considering the amount of resources in the information transmission process between nodes. However, they ignore the information channels and their information capacity in information transmission process between two endpoints. Motivated by the Cannikin Law, the definition of information capacity is proposed to quantify the information transmission capability between any two nodes. Then, based on the information capacity, a potential information capacity (PIC) index is proposed for link prediction. Empirical study on 15 datasets has shown that the PIC index we proposed can achieve a good performance, compared with eight mainstream baselines.
Article
Full-text available
Despite persistent efforts in understanding the creativity of scientists over different career stages, little is known about the underlying dynamics of research topic switching that drives innovation. Here, we analyze the publication records of individual scientists, aiming to quantify their topic switching dynamics and its influence. We find that the co-citing network of papers of a scientist exhibits a clear community structure where each major community represents a research topic. Our analysis suggests that scientists have a narrow distribution of number of topics. However, researchers nowadays switch more frequently between topics than those in the early days. We also find that high switching probability in early career is associated with low overall productivity, yet with high overall productivity in latter career. Interestingly, the average citation per paper, however, is in all career stages negatively correlated with the switching probability. We propose a model that can explain the main observed features.
Article
Full-text available
The increasingly available large-scale bibliographic data that generate a heterogeneous network provide opportunities to detect, track and predict the evolution of science. Recently, many efforts have been devoted to quantify the impact of scientific papers within different citation time windows. However, the complex patterns of the citation network make it difficult to predict future citations on the basis of a short time window. Accordingly, we present a data-centric methodology to predict long-term scientific impact by combining numerous bibliographic features and convolutional neural network. More specifically, we first expand the input features from the annual citation records to the features of the whole heterogeneous bibliographic information network which completely represents the topology structure of academic activities. Then, a convolutional neural network model is designed to capture the complex nonlinear relationships between early network features and final cumulative citation count. Last, we conduct an experiment on papers of Markov Chain from 1980 to 1985. The result shows the prediction performance can be improved by 5% to baseline models under the same problem definition and with the same dataset. Meanwhile, the long-term scientific impacts are strongly correlated with its recognition by authoritative authors or venues in the early stage.
Article
Full-text available
As social networks play an increasingly important role in people’s lives, people are more likely to discuss hot topics on social networks. Predicting the spread of hot topics, known as topic propagation prediction, is an important task. Due to the unpredictability of the users and topics in social networks, predicting the topic propagation trend is still a major challenge. Different users play different roles in the topic propagation. However, existing studies have not utilized user role analysis. In this paper, we propose a topic propagation prediction method (TPP) based on user role analysis and dynamic probability model. First, we describe our user role analysis, which incorporates four user-factors to characterize user attributes along two dimensions. Second, we combine dynamic probability model with user role analysis to accurately predict topic propagation trend. Finally, we prove the efficiency of TPP by experiments.
Article
Full-text available
Background Improving efficiency of disease diagnosis based on phenotype ontology is a critical yet challenging research area. Recently, Human Phenotype Ontology (HPO)-based semantic similarity has been affectively and widely used to identify causative genes and diseases. However, current phenotype similarity measurements just consider the annotations and hierarchy structure of HPO, neglecting the definition description of phenotype terms. Results In this paper, we propose a novel phenotype similarity measurement, termed as DisPheno, which adequately incorporates the definition of phenotype terms in addition to HPO structure and annotations to measure the similarity between phenotype terms. DisPheno also integrates phenotype term associations into phenotype-set similarity measurement using gene and disease annotations of phenotype terms. Conclusions Compared with five existing state-of-the-art methods, DisPheno shows great performance in HPO-based phenotype semantic similarity measurement and improves the efficiency of disease identification, especially on noisy patients dataset.
Article
Full-text available
Measuring drug-drug similarity is important but challenging. Significant progresses have been made in drugs whose labeled training data is sufficient and available. However, handling data skewness and incompleteness with domain-specific knowledge graph, is still a relatively new territory and an under-explored prospect. In this paper, we present a system KGDDS for node-link-based bio-medical Knowledge Graph curation and visualization, aiding Drug-Drug Similarity measure. Specifically, we reuse existing knowledge bases to alleviate the difficulties in building a high-quality knowledge graph, ranging in size up to 7 million edges. Then we design a prediction model to explore the pharmacology features and knowledge graph features. Finally, we propose a user interaction model to allow the user to better understand the drug properties from a drug similarity perspective and gain insights that are not easily observable in individual drugs. Visual result demonstration and experimental results indicate that KGDDS can bridge the user/caregiver gap by facilitating antibiotics prescription knowledge, and has remarkable applicability, outperforming existing state-of-the-art drug similarity measures.
Article
Full-text available
One of the most universal trends in science and technology today is the growth of large teams in all areas, as solitary researchers and small teams diminish in prevalence1–3. Increases in team size have been attributed to the specialization of scientific activities³, improvements in communication technology4,5, or the complexity of modern problems that require interdisciplinary solutions6–8. This shift in team size raises the question of whether and how the character of the science and technology produced by large teams differs from that of small teams. Here we analyse more than 65 million papers, patents and software products that span the period 1954–2014, and demonstrate that across this period smaller teams have tended to disrupt science and technology with new ideas and opportunities, whereas larger teams have tended to develop existing ones. Work from larger teams builds on more-recent and popular developments, and attention to their work comes immediately. By contrast, contributions by smaller teams search more deeply into the past, are viewed as disruptive to science and technology and succeed further into the future—if at all. Observed differences between small and large teams are magnified for higher-impact work, with small teams known for disruptive work and large teams for developing work. Differences in topic and research design account for a small part of the relationship between team size and disruption; most of the effect occurs at the level of the individual, as people move between smaller and larger teams. These results demonstrate that both small and large teams are essential to a flourishing ecology of science and technology, and suggest that, to achieve this, science policies should aim to support a diversity of team sizes.
Article
Full-text available
This article can be downloaded from Elsevier up until February, 20th 2019. Please, use the following complimentary link: https://authors.elsevier.com/c/1YKEo_dqHHNzzU The objective assessment of the prestige of an academic institution is a difficult and hotly debated task. In the last few years, different types of university rankings have been proposed to quantify it, yet the debate on what rankings are exactly measuring is enduring. To address the issue we have measured a quantitative and reliable proxy of the academic reputation of a given institution and compared our findings with well-established impact indicators and academic rankings. Specifically, we study citation patterns among universities in five different Web of Science Subject Categories and use the PageRank algorithm on the five resulting citation networks. The rationale behind our work is that scientific citations are driven by the reputation of the reference so that the PageRank algorithm is expected to yield a rank which reflects the reputation of an academic institution in a specific field. Given the volume of the data analysed, our findings are statistically sound and less prone to bias, than, for instance, ad–hoc surveys often employed by ranking bodies in order to attain similar outcomes. The approach proposed in our paper may contribute to enhance ranking methodologies, by reconciling the qualitative evaluation of academic prestige with its quantitative measurements via publication impact.
Article
Full-text available
The ability to predict the long-term impact of a scientific article soon after its publication is of great value towards accurate assessment of research performance. In this work we test the hypothesis that good predictions of long-term citation counts can be obtained through a combination of a publication's early citations and the impact factor of the hosting journal. The test is performed on a corpus of 123,128 WoS publications authored by Italian scientists, using linear regression models. The average accuracy of the prediction is good for citation time windows above two years, decreases for lowly-cited publications, and varies across disciplines. As expected, the role of the impact factor in the combination becomes negligible after only two years from publication.
Article
Full-text available
Background: Systems biology is an important field for understanding whole biological mechanisms composed of interactions between biological components. One approach for understanding complex and diverse mechanisms is to analyze biological pathways. However, because these pathways consist of important interactions and information on these interactions is disseminated in a large number of biomedical reports, text-mining techniques are essential for extracting these relationships automatically. Results: In this study, we applied node2vec, an algorithmic framework for feature learning in networks, for relationship extraction. To this end, we extracted genes from paper abstracts using pkde4j, a text-mining tool for detecting entities and relationships. Using the extracted genes, a co-occurrence network was constructed and node2vec was used with the network to generate a latent representation. To demonstrate the efficacy of node2vec in extracting relationships between genes, performance was evaluated for gene-gene interactions involved in a type 2 diabetes pathway. Moreover, we compared the results of node2vec to those of baseline methods such as co-occurrence and DeepWalk. Conclusions: Node2vec outperformed existing methods in detecting relationships in the type 2 diabetes pathway, demonstrating that this method is appropriate for capturing the relatedness between pairs of biological entities involved in biological pathways. The results demonstrated that node2vec is useful for automatic pathway construction.
Article
Full-text available
Background Comparing and classifying functions of gene products are important in today’s biomedical research. The semantic similarity derived from the Gene Ontology (GO) annotation has been regarded as one of the most widely used indicators for protein interaction. Among the various approaches proposed, those based on the vector space model are relatively simple, but their effectiveness is far from satisfying. ResultsWe propose a Hierarchical Vector Space Model (HVSM) for computing semantic similarity between different genes or their products, which enhances the basic vector space model by introducing the relation between GO terms. Besides the directly annotated terms, HVSM also takes their ancestors and descendants related by “is_a” and “part_of” relations into account. Moreover, HVSM introduces the concept of a Certainty Factor to calibrate the semantic similarity based on the number of terms annotated to genes. To assess the performance of our method, we applied HVSM to Homo sapiens and Saccharomyces cerevisiae protein-protein interaction datasets. Compared with TCSS, Resnik, and other classic similarity measures, HVSM achieved significant improvement for distinguishing positive from negative protein interactions. We also tested its correlation with sequence, EC, and Pfam similarity using online tool CESSM. ConclusionsHVSM showed an improvement of up to 4% compared to TCSS, 8% compared to IntelliGO, 12% compared to basic VSM, 6% compared to Resnik, 8% compared to Lin, 11% compared to Jiang, 8% compared to Schlicker, and 11% compared to SimGIC using AUC scores. CESSM test showed HVSM was comparable to SimGIC, and superior to all other similarity measures in CESSM as well as TCSS. Supplementary information and the software are available at https://github.com/kejia1215/HVSM.
Article
Full-text available
This paper presents the DIS-C approach, which is a novel method to assess the conceptual distance between concepts within an ontology. DIS-C is graph based in the sense that the whole topology of the ontology is considered when computing the weight of the relationships between concepts. The methodology is composed of two main steps. First, in order to take advantage of previous knowledge, an expert of the ontology domain assigns initial weight values to each of the relations in the ontology. Then, an automatic method for computing the conceptual relations refines the weights assigned to each relation until reaching a stable state. We introduce a metric called generality that is defined in order to evaluate the accessibility of each concept, considering the ontology like a strongly connected graph. Unlike most previous approaches, the DIS-C algorithm computes similarity between concepts in ontologies that are not necessarily represented in a hierarchical or taxonomic structure. So, DIS-C is capable of incorporating a wide variety of relationships between concepts such as meronymy, antonymy, functionality and causality.
Article
Full-text available
The whys and wherefores of SciSci The science of science (SciSci) is based on a transdisciplinary approach that uses large data sets to study the mechanisms underlying the doing of science—from the choice of a research problem to career trajectories and progress within a field. In a Review, Fortunato et al. explain that the underlying rationale is that with a deeper understanding of the precursors of impactful science, it will be possible to develop systems and policies that improve each scientist's ability to succeed and enhance the prospects of science as a whole. Science , this issue p. eaao0185
Article
Full-text available
The science of science (SOS) is a rapidly developing field which aims to understand, quantify and predict scientific research and the resulting outcomes. The problem is essentially related to almost all scientific disciplines and thus has attracted attention of scholars from different backgrounds. Progress on SOS will lead to better solutions for many challenging issues, ranging from the selection of candidate faculty members by a university to the development of research fields to which a country should give priority. While different measurements have been designed to evaluate the scientific impact of scholars, journals and academic institutions, the multiplex structure, dynamics and evolution mechanisms of the whole system have been much less studied until recently. In this article, we review the recent advances in SOS, aiming to cover the topics from empirical study, network analysis, mechanistic models, ranking, prediction, and many important related issues. The results summarized in this review significantly deepen our understanding of the underlying mechanisms and statistical rules governing the science system. Finally, we review the forefront of SOS research and point out the specific difficulties as they arise from different contexts, so as to stimulate further efforts in this emerging interdisciplinary field.
Article
Full-text available
In this article some recent disputes about the usefulness of PageRank-based methods for the task of identifying influential researchers in citation networks are discussed. In particular, it focuses on the performance of these methods in relation to simple citation counts. With the aim of comparing these two classes of ranking methods, we analyze a large citation network of authors based on almost two million computer science papers and apply four PageRank-based and citations-based techniques to rank authors by importance throughout the period 1990–2014 on a yearly basis. We use ACM SIGMOD E. F. Codd Innovations Award and ACM A. M. Turing Award winners in our baseline lists of outstanding scientists and define four relevance weighting schemes with some predictive power for the ranking methods to increase the relevance of researchers winning in the future. We conclude that citations-based rankings perform better for Codd Award winners, but PageRank-based methods do so for Turing Award recipients when using absolute ranks and PageRank-based rankings outperform the citations-based techniques for both Codd and Turing Award laureates when relative ranks are considered. However, the two ranking groups show smaller differences if more weight is assigned to the relevance of future awardees.
Conference Paper
Full-text available
We study the problem of representation learning in heterogeneous networks. Its unique challenges come from the existence of multiple types of nodes and links, which limit the feasibility of the conventional network embedding techniques. We develop two scalable representation learning models, namely metapath2vec and metapath2vec++. The metapath2vec model formalizes meta-path-based random walks to construct the heterogeneous neighborhood of a node and then leverages a heterogeneous skip-gram model to perform node embeddings. The metapath2vec++ model further enables the simultaneous modeling of structural and semantic correlations in heterogeneous networks. Extensive experiments show that metapath2vec and metapath2vec++ are able to not only outperform state-of-the-art embedding models in various heterogeneous network mining tasks, such as node classification, clustering, and similarity search, but also discern the structural and semantic correlations between diverse network objects.
Article
Full-text available
Link prediction problem in complex networks has received substantial amount of attention in the field of social network analysis. Though initial studies consider only static snapshot of a network, importance of temporal dimension has been observed and cultivated subsequently. In recent times, multi-domain relationships between node-pairs embedded in real networks have been exploited to boost link prediction performance. In this paper, we combine multi-domain topological features as well as temporal dimension, and propose a robust and efficient feature set called TMLP (Time-aware Multi-relational Link Prediction) for link prediction in dynamic heterogeneous networks. It combines dynamics of graph topology and history of interactions at dyadic level, and exploits time-series model in the feature extraction process. Several experiments on two networks prepared from DBLP bibliographic dataset show that the proposed framework outperforms the existing methods significantly, in predicting future links. It also demonstrates the necessity of combining heterogeneous information with temporal dynamics of graph topology and dyadic history in order to predict future links. Empirical results find that the proposed feature set is robust against longitudinal bias.
Article
Full-text available
Prebiotics contribute to the well-being of their host by altering the composition of the gut microbiota. Discovering new prebiotics is a challenging and arduous task due to strict inclusion criteria; thus, highly limited numbers of prebiotic candidates have been identified. Notably, the large numbers of published studies may contain substantial information attached to various features of known prebiotics that can be used to predict new candidates. In this paper, we propose a medical subject headings (MeSH)-based text mining method for identifying new prebiotics with structured texts obtained from PubMed. We defined an optimal feature set for prebiotics prediction using a systematic feature-ranking algorithm with which a variety of carbohydrates can be accurately classified into different clusters in accordance with their chemical and biological attributes. The optimal feature set was used to separate positive prebiotics from other carbohydrates, and a cross-validation procedure was employed to assess the prediction accuracy of the model. Our method achieved a specificity of 0.876 and a sensitivity of 0.838. Finally, we identified a high-confidence list of candidates of prebiotics that are strongly supported by the literature. Our study demonstrates that text mining from high-volume biomedical literature is a promising approach in searching for new prebiotics.
Article
Full-text available
In this paper we show that assigning weights to the edges in a collaboration network of authors, according to a decreasing exponential function depending on the time elapsed since the publication of a common paper, may add valuable information to the process of ranking authors based on importance. The main idea is that a recent collaboration represents a stronger tie between the co-authors than an older one and, therefore, reduces the weight of potential citations between the co-authors. We test this approach, on a well-known data set and with an established methodology of using PageRank-based ranking techniques and reference sets of awarded authors and demonstrate that edge ageing may improve the ranking of authors.
Article
Full-text available
Topic-based ranking of authors, papers and journals can serve as a vital tool for identifying authorities of a given topic within a particular domain. Existing methods that measure topic-based scholarly output are limited to homogeneous networks. This study proposes a new informative metric called Topic-based Heterogeneous Rank (TH Rank) which measures the impact of a scholarly entity with respect to a given topic in a heterogeneous scholarly network containing authors, papers and journals. TH Rank calculates topic-dependent ranks for authors by considering the combined impact of the multiple factors which contribute to an author’s level of prestige. Information retrieval serves as the test field and articles about information retrieval published between 1956 and 2014 were extracted from web of science. Initial results show that TH Rank can effectively identify the most prestigious authors, papers and journals related to a specific topic.
Article
Full-text available
Expert finding problem in bibliographic networks has received increased interest in recent years. This problem concerns finding relevant researchers for a given topic. Motivated by the observation that rarely do all coauthors contribute to a paper equally, in this paper, we propose two discriminative methods for realizing leading authors contributing in a scientific publication. Specifically, we cast the problem of expert finding in a bibliographic network to find leading experts in a research group, which is easier to solve. We recognize three feature groups that can discriminate relevant experts from other authors of a document. Experimental results on a real dataset, and a synthetic one that is gathered from a Microsoft academic search engine, show that the proposed model significantly improves the performance of expert finding in terms of all common information retrieval evaluation metrics.
Article
Predicting the development trend of future scientific research not only provides a reference for researchers to understand the development of the discipline, but also provides support for decision‐making and fund allocation for decision‐makers. The continuous growth of scientific publications has brought challenges to track the development trends of scientific research topics. The existing topic trend prediction methods have proved that the research topic trend of a publication is influenced by other peer publications. However, they ignore the fact that the research topics of different publications belong to different research topic space. Moreover, the existing topic prediction methods do not fully consider the interactive influence among publications that the research topic of one publication affects the topics of other publications, it is also influenced by the research topics of other publications. In line with this, this paper proposes a scientific research topic trend prediction model based on multi‐long short‐term memory (multi‐LSTM) and Graph Convolutional Network. Specifically, multiple LSTMs are employed to map research topics of different publications into their respective topic space. Then, the graph convolutional neural network is applied to learn the scientific influence context of each publication, so that the research topic of each publication not only integrates the influence of neighbor nodes, but also considers the influence of the neighbors of the neighbor node on the research topic of the publication, so as to more accurately fuse scientific influence context of research topic of peer publications. Experiments results on the data set of scientific research papers in the field of artificial intelligence and data mining demonstrate that the model improves the prediction precision and achieves the state‐of‐the‐art research topic trend prediction effect compared with the other baseline models.
Article
Predicting emerging research topics is important to researchers and policymakers. In this study, we propose a two-step solution to the problem of emerging topic prediction. The first step forecasts the future popularity score, a novel indicator reflecting the impact and growth, of candidate topics in a time-series manner. The second step selects novel topics from the candidates predicted to be popular in the first step. Terms with domain characteristics are used as candidate topics. Deep neural networks, specifically LSTM and NNAR, are applied with nine features of topics to predict popularity score. We evaluated the models and five baselines on two datasets from two perspectives, i.e., the ability to (1) predict the correct indicator value and (2) reconstruct the optimal ranking order. Two types of training strategies were compared, including a global strategy that trains a model with all topics and two local strategies that train separate models with different groups of topics. Our results show that LSTM and NNAR outperform other models in predicting the value of popularity score measured by MAE and RMSE, while LightGBM is a competitive baseline in ranking the topics in terms of [email protected] The performance difference of global and local strategies is not significant. Emerging topics predicted by our approach are compared with those by other methods. A qualitative assessment on nominated emerging topics suggests topics nominated by machine learning methods are more alike than those by the rule-based model. Some important topics are nominated according to a preliminary literature analysis. This study exploited the strengths of both machine learning and bibliometric indicator approaches for emerging topic prediction. Deep neural networks are applied where objective optimization target can be defined and measured. Bibliometric indicator offers an efficient way to select novel topics from candidates. The hybrid approach shows promise in considering various characteristics of emerging topics when making predictions.
Article
Real-world, multiple-typed objects are often interconnected, forming heterogeneous information networks. A major challenge for link-based clustering in such networks is their potential to generate many different results, carrying rather diverse semantic meanings. In order to generate desired clustering, we propose to use meta-path , a path that connects object types via a sequence of relations, to control clustering with distinct semantics. Nevertheless, it is easier for a user to provide a few examples (seeds) than a weighted combination of sophisticated meta-paths to specify her clustering preference. Thus, we propose to integrate meta-path selection with user-guided clustering to cluster objects in networks, where a user first provides a small set of object seeds for each cluster as guidance. Then the system learns the weight for each meta-path that is consistent with the clustering result implied by the guidance, and generates clusters under the learned weights of meta-paths. A probabilistic approach is proposed to solve the problem, and an effective and efficient iterative algorithm, PathSelClus , is proposed to learn the model, where the clustering quality and the meta-path weights mutually enhance each other. Our experiments with several clustering tasks in two real networks and one synthetic network demonstrate the power of the algorithm in comparison with the baselines.
Book
This book describes methods and tools that empower information providers to build and maintain knowledge graphs, including those for manual, semi-automatic, and automatic construction; implementation; and validation and verification of semantic annotations and their integration into knowledge graphs. It also presents lifecycle-based approaches for semi-automatic and automatic curation of these graphs, such as approaches for assessment, error correction, and enrichment of knowledge graphs with other static and dynamic resources. Chapter 1 defines knowledge graphs, focusing on the impact of various approaches rather than mathematical precision. Chapter 2 details how knowledge graphs are built, implemented, maintained, and deployed. Chapter 3 then introduces relevant application layers that can be built on top of such knowledge graphs, and explains how inference can be used to define views on such graphs, making it a useful resource for open and service-oriented dialog systems. Chapter 4 discusses applications of knowledge graph technologies for e-tourism and use cases for other verticals. Lastly, Chapter 5 provides a summary and sketches directions for future work. The additional appendix introduces an abstract syntax and semantics for domain specifications that are used to adapt schema.org to specific domains and tasks. To illustrate the practical use of the approaches presented, the book discusses several pilots with a focus on conversational interfaces, describing how to exploit knowledge graphs for e-marketing and e-commerce. It is intended for advanced professionals and researchers requiring a brief introduction to knowledge graphs and their implementation.
Article
Online learning has been present since the early days of the Internet. As with any new technology, users look to make their life easier and to save time. Experts in medical education are no different than other users. They want to adapt new technologies to their fullest. Medical educators have been challenged with keeping education interesting and up to date, while maximizing their resources. The challenges with any online educational program include being able to reach large numbers of learners, having content that is relevant and timely, and having it available thorough many different formats to suit the user. There are many examples of online learning programs in all fields of medicine and many specific to Allergy/Immunology. In this review, we describe a form of real-time videoconferencing referred to as Conferences On-Line Allergy (COLA), which was developed at Children's Mercy Hospital and Clinics. This program, which started as a once a month webinar, has transformed into a well-known curriculum used by many Allergy/Immunology training programs across the United States. It provides not only live interactive conferences but also a library of recorded lectures and workshops that can be used at the learner's convenience. Taking advantage of the generosity of many volunteer presenters, it allows sharing of resources and provides benefits to the Allergy/Immunology community.
Article
Background: With the development of new magnetic resonance imaging (MRI) techniques, an increasing number of articles have been published regarding hepatocellular carcinoma magnetic resonance imaging (HCCMRI) in the past decade. However, few studies have statistically analyzed these published articles. In this study, we aim to systematically evaluate the scientific outcomes of HCCMRI research and explore the research hotspots from 2008 to 2017. Methods: The included articles regarding HCCMRI research from 2008 to 2017 were downloaded from the Web of Science Core Collection and verified by two experienced radiologists. Excel 2016 was used to analyze the literature data, including the publication years and journals. CiteSpace V was used to perform co-occurrence analyses for authors, countries/regions and institutions and to generate the related collaboration network maps. Reference co-citation analysis (RCA) and burst keyword detection were also performed using CiteSpace V to explore the research hotspots in the past decade. Results: A total of 835 HCCMRI articles published from 2008 to 2017 were identified. Journal of Magnetic Resonance Imaging published the most articles (79 publications, 9.46%). Extensive cooperating relationship were observed among countries/regions and among authors. South Korea had the most publications (199 publications, 21.82%), followed by the United States of America (USA) (190 publications, 20.83%), Japan (162 publications, 17.76%), and the People's Republic of China (148 publications, 16.23%). Among the top 10 co-cited authors, Bruix J (398 citations) was ranked first, followed by Llovet JM (235 citations), Kim YK (170 citations) and Forner A (152 citations). According to the RCA, ten major clusters were explored over the last decade; "LI-RADS data system" and "microvascular invasion" (MVI) were the two most recent clusters. Forty-seven burst keywords with the highest citation strength were detected over time. Of these keywords, "microvascular invasion" had the highest strength in the last 3 years. The LI-RADS has been constantly updated with the latest edition released in July 2018. However, the LI-RADS still has limitations in identifying certain categories of lesions by conceptual and non-quantitative probabilistic methods. Plenty of questions still need to be further answered such as the difference of diagnostic efficiency of each major/ancillary imaging features. Preoperative prediction of MVI of HCC is very important to therapeutic decision-making. Some parameters of Gd-EOB-DTPA-enhanced MRI were found to be useful in prediction of MVI, however, with a high specificity but a very low sensitivity. Comprehensive predictive model incorporating both imaging and clinical variables may be the more preferable in prediction of MVI of HCC. Conclusions: HCCMRI-related publications displayed a gradually increasing trend from 2008 to 2017. The USA has a central position in collaboration with other countries/regions, while South Korea contributed the most in the number of publications. Of the ten major clusters identified in the RCA, the two most recent clusters were "LI-RADS data system" and "microvascular invasion", indicative of the current HCCMRI research hotspots.
Article
Similarity is one of the most straightforward ways to relate objects and guide the human perception of the world. It has an important role in many areas, such as Information Retrieval, Natural Language Processing, Semantic Web and Recommender Systems. To help applications in these areas achieve satisfying results when finding similar concepts, it is important to simulate human perception of similarity and assess which similarity measure is the most adequate. We propose Sigmoid similarity, a feature-based semantic similarity measure on instances in a specific ontology, as an improvement of Dice measure. We performed two separate evaluations with real evaluators. The first evaluation includes 137 subjects and 25 pairs of concepts in the recipes domain and the second one includes 147 subjects and 30 pairs of concepts in the drinks domain. To the best of our knowledge these are some of the most extensive evaluations in the field. We also explored the performance of some hierarchy-based approaches and showed that feature-based approaches outperform them on two specific ontologies we tested. In addition, we tried to incorporate hierarchy-based information into our measures and concluded it is not worth complicating the measures only based on features with additional information since they perform comparably.
Article
Link prediction is considered as one of the key tasks in various data mining applications for recommendation systems, bioinformatics, security and worldwide web. The majority of previous works in link prediction mainly focus on the homogeneous networks which only consider one type of node and link. However, real-world networks have heterogeneous interactions and complicated dynamic structure, which make link prediction a more challenging task. In this paper, we have studied the problem of link prediction in the dynamic, undirected, weighted/unweighted, heterogeneous social networks which are composed of multiple types of nodes and links that change over time. We propose a novel method, called Multivariate Time Series Link Prediction for evolving heterogeneous networks that incorporate (1) temporal evolution of the network; (2) correlations between link evolution and multi-typed relationships; (3) local and global similarity measures; and (4) node connectivity information. Our proposed method and the previously proposed time series methods are evaluated experimentally on a real-world bibliographic network (DBLP) and a social bookmarking network (Delicious). Experimental results show that the proposed method outperforms the previous methods in terms of AUC measures in different test cases.
Article
We propose measures of the impact of research that improve on existing ones such as counting of number of papers, citations and $h$-index. Since different papers and different fields have largely different average number of co-authors and of references we replace citations with individual citations, shared among co-authors. Next, we improve on citation counting applying the PageRank algorithm to citations among papers. Being time-ordered, this reduces to a weighted counting of citation descendants that we call PaperRank. Similarly, we compute an AuthorRank applying the PageRank algorithm to citations among authors. These metrics quantify the impact of an author or paper taking into account the impact of those authors that cite it. Finally, we show how self- and circular- citations can be eliminated by defining a closed market of citation-coins. We apply these metrics to the InSpire database that covers fundamental physics, ranking papers, authors, journals, institutes, towns, countries, continents, genders, for all-time and in recent time periods.
Article
As network analysis methods prevail, more metrics are applied to co-word networks to reveal hot topics in a field. However, few studies have examined the relationships among these metrics. To bridge this gap, this study explores the relationships among different ranking metrics, including one frequency-based and six network-based metrics, in order to understand the impact of network structural features on ranking themes on co-word networks. We collected bibliographic data from three disciplines from Web of Science (WoS), and generated 40 simulation networks following the preferential attachment assumption. Correlation analysis on the empirical and simulated networks shows strong relationships among the metrics. Their relationships are consistent across disciplines. The metrics can be categorized into three groups according to the strength of their correlations, where Degree Centrality, H-index, and Coreness are in one group, Betweenness Centrality, Clustering Coefficient, and frequency in another, and Weighted PageRank by itself. Regression analysis on the simulation networks reveals that network topology properties, such as connectivity, sparsity, and aggregation, influence the relationships among selected metrics. In addition, when comparing the top keywords ranked by the metrics in the three disciplines, we found the metrics exhibit different discriminative capacity. Coreness and H-index may be better suited for categorizing keywords rather than ranking keywords. Findings from this study contribute to a better understanding of the relationships among different metrics and provide guidance for using them effectively in different contexts.
Article
The process of topic propagation always interweaves information diffusion and opinion evolution, but most previous works studied the models of information diffusion and opinion evolution separately, and seldom focused on their interaction of each other. To shed light on the effect of users' opinion evolution on information diffusion in online social networks, we proposed a model which incorporates opinion evolution into the process of topic propagation. Several real topics propagating on Sina Microblog were collected to analyze individuals' propagation intentions, and different propagation intentions were considered in the model. The topic propagation was simulated to explore the impact of different opinion distributions and intervention with opposite opinion on information diffusion. Results show that the topic with one-sided opinions can spread faster and more widely, and intervention with opposite opinion is an effective measure to guide the topic propagation. The earlier to intervene, the more effectively the topic propagation would be guided.
Conference Paper
In this paper, we propose a novel representation learning framework, namely HIN2Vec, for heterogeneous information networks (HINs). The core of the proposed framework is a neural network model, also called HIN2Vec, designed to capture the rich semantics embedded in HINs by exploiting different types of relationships among nodes. Given a set of relationships specified in forms of meta-paths in an HIN, HIN2Vec carries out multiple prediction training tasks jointly based on a target set of relationships to learn latent vectors of nodes and meta-paths in the HIN. In addition to model design, several issues unique to HIN2Vec, including regularization of meta-path vectors, node type selection in negative sampling, and cycles in random walks, are examined. To validate our ideas, we learn latent vectors of nodes using four large-scale real HIN datasets, including Blogcatalog, Yelp, DBLP and U.S. Patents, and use them as features for multi-label node classification and link prediction applications on those networks. Empirical results show that HIN2Vec soundly outperforms the state-of-the-art representation learning models for network data, including DeepWalk, LINE, node2vec, PTE, HINE and ESim, by 6.6% to 23.8% of $micro$-$f_1$ in multi-label node classification and 5% to 70.8% of $MAP$ in link prediction.
Article
The desire to predict discoveries—to have some idea, in advance, of what will be discovered, by whom, when, and where—pervades nearly all aspects of modern science, from individual scientists to publishers, from funding agencies to hiring committees. In this Essay, we survey the emerging and interdisciplinary field of the “science of science” and what it teaches us about the predictability of scientific discovery. We then discuss future opportunities for improving predictions derived from the science of science and its potential impact, positive and negative, on the scientific community.
Conference Paper
In this paper, we study the problem of author identification under double-blind review setting, which is to identify potential authors given information of an anonymized paper. Different from existing approaches that rely heavily on feature engineering, we propose to use network embedding approach to address the problem, which can automatically represent nodes into lower dimensional feature vectors. However, there are two major limitations in recent studies on network embedding: (1) they are usually general-purpose embedding methods, which are independent of the specific tasks; and (2) most of these approaches can only deal with homogeneous networks, where the heterogeneity of the network is ignored. Hence, challenges faced here are two folds: (1) how to embed the network under the guidance of the author identification task, and (2) how to select the best type of information due to the heterogeneity of the network. To address the challenges, we propose a task-guided and path-augmented heterogeneous network embedding model. In our model, nodes are first embedded as vectors in latent feature space. Embeddings are then shared and jointly trained according to task-specific and network-general objectives. We extend the existing unsupervised network embedding to incorporate meta paths in heterogeneous networks, and select paths according to the specific task. The guidance from author identification task for network embedding is provided both explicitly in joint training and implicitly during meta path selection. Our experiments demonstrate that by using path-augmented network embedding with task guidance, our model can obtain significantly better accuracy at identifying the true authors comparing to existing methods.
Article
This paper presents HBIN-LBD, a novel literature-based discovery (LBD) method that exploits the lexico-citation structures within the heterogeneous bibliographic information network (HBIN) graphs. Unlike other existing LBD methods, HBIN-LBD harnesses the metapath features found in HBIN graphs for discovering the latent associations between scientific papers published in otherwise disconnected research areas. Further, this paper investigates the effects of incorporating semantic and topic modeling components into the proposed models. Using time-sliced historical bibliographic data, we demonstrate the performance of our method by reconstructing two LBD hypotheses: the Fish Oil and Raynaud’s Syndrome hypothesis and the Migraine and Magnesium hypothesis. The proposed method is capable of predicting the future co-citation links between research papers of these previously disconnected research areas with up to 88.86% accuracy and 0.89 F-measure.
Article
Altmetrics is an emergent research area whereby social media is applied as a source of metrics to assess scholarly impact. In the last few years, the interest in altmetrics has grown, giving rise to many questions regarding their potential benefits and challenges. This paper aims to address some of these questions. First, we provide an overview of the altmetrics landscape, comparing tool features, social media data sources, and social media events provided by altmetric aggregators. Second, we conduct a systematic review of the altmetrics literature. A total of 172 articles were analysed, revealing a steady rise in altmetrics research since 2011. Third, we analyse the results of over 80 studies from the altmetrics literature on two major research topics: cross-metric validation and coverage of altmetrics. An aggregated percentage coverage across studies on 11 data sources shows that Mendeley has the highest coverage of about 59 % across 15 studies. A meta-analysis across more than 40 cross-metric validation studies shows overall a weak correlation (ranging from 0.08 to 0.5) between altmetrics and citation counts, confirming that altmetrics do indeed measure a different kind of research impact, thus acting as a complement rather than a substitute to traditional metrics. Finally, we highlight open challenges and issues facing altmetrics and discuss future research areas.
Conference Paper
Prediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by learning the features themselves. However, present feature learning approaches are not expressive enough to capture the diversity of connectivity patterns observed in networks. Here we propose node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a low-dimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node's network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. We demonstrate the efficacy of node2vec over existing state-of-the-art techniques on multi-label classification and link prediction in several real-world networks from diverse domains. Taken together, our work represents a new way for efficiently learning state-of-the-art task-independent representations in complex networks.
Article
The volume of the existing research literature is such it can make it difficult to find highly relevant information and to develop an understanding of how a scientific topic has evolved. Prior research on topic evolution has often leveraged refinements to Latent Dirichlet Allocation (LDA) to identify emerging topics. However, such methods do not answer the question of which studies contributed to the evolution of a topic. In this paper we show that meta-paths over a heterogeneous bibliographic network (consisting of papers, authors and venues) can be used to identify the network elements that made the greatest contributions to a topic. In particular, by adding derived edges that capture the contribution of papers, authors, and venues to a topic (using PageRank algorithm), a restricted meta-path over the bibliographic network can be used to restrict the evolution of topics to the context of interest to a researcher. We use such restricted meta-paths to construct a topic evolution tree that can provide researchers with a web-based visualization of the evolution of a scientific topic in the context of interest to them. Compared to baseline networks without restrictions, we find that restricted networks provide more useful topic evolution trees.
Article
Understanding information propagation in online social networks is important in many practical applications and is of great interest to many researchers. The challenge with the existing propagation models lies in the requirement of complete network structure, topic-dependent model parameters and topic isolated spread assumption, etc. In this paper, we study the characteristics of multi-topic information propagation based on the data collected from Sina Weibo, one of the most popular microblogging services in China. We find that the daily total amount of user resources is finite and users’ attention transfers from one topic to another. This shows evidence on the competitions between multiple dynamical topics. According to these empirical observations, we develop a competition-based multi-topic information propagation model without social network structure. This model is built based on general mechanisms of resource competitions, i.e. attracting and distracting users’ attention, and considers the interactions of multiple topics. Simulation results show that the model can effectively produce topics with temporal popularity similar to the real data. The impact of model parameters is also analysed. It is found that topic arrival rate reflects the strength of competitions, and topic fitness is significant in modelling the small scale topic propagation.
Conference Paper
The idea behind AuthorRank is that a content created by more popular authors should rank higher than the content created by less popular authors. This paper brings this idea into scientific publications analysis to test whether the optimized topical AuthorRank can replace or enhance topical PageRank for publication ranking. First, the PageRank with Priors (PRP) algorithm was employed to rank topic-based publications and authors. Second, the first author's reputation was used for generating an AuthorRank score. Additionally, linear combination method of topical AuthorRank and PageRank were compared with several baselines. Finally, as shown in our evaluation results, the performance of topical AuthorRank combined with topic-based PageRank is better than other baselines for publication ranking.
Article
Academic literature has been continuously growing at such a pace that it can be difficult to follow the progression of scientific achievements; hence, the need to dispose of quantitative knowledge support systems to analyze the literature of a subject. In this article we utilize network analysis tools to build a literature review of scientific documents published in the multidisciplinary field of Strategic Environment Assessment (SEA). The proposed approach helps researchers to build unbiased and comprehensive literature reviews. We collect information on 7662 SEA publications and build the SEA Bibliographic Network (SEABN) employing the basic idea that two publications are interconnected if one cites the other. We apply network analysis at macroscopic (network architecture), mesoscopic (sub graph) and microscopic levels (node) in order to i) verify what network structure characterizes the SEA literature, ii) identify the authors, disciplines and journals that are contributing to the international discussion on SEA, and iii) scrutinize the most cited and important publications in the field. Results show that the SEA is a multidisciplinary subject; the SEABN belongs to the class of real small world networks with a dominance of publications in Environmental studies over a total of 12 scientific sectors. Christopher Wood, Olivia Bina, Matthew Cashmore, and Andrew Jordan are found to be the leading authors while Environmental Impact Assessment Review is by far the scientific journal with the highest number of publications in SEA studies.