Fig 12 - uploaded by Xiaoxin Yin
Content may be subject to copyright.
Scalability of TRUTHFINDER with respect to the number of facts. (a) Time. (b) Memory.

Scalability of TRUTHFINDER with respect to the number of facts. (a) Time. (b) Memory.

Source publication
Article
Full-text available
The World Wide Web has become the most important information source for most of us. Unfortunately, there is no guarantee for the correctness of information on the Web. Moreover, different websites often provide conflicting information on a subject, such as different specifications for the same product. In this paper, we propose a new problem, calle...

Similar publications

Conference Paper
Full-text available
The world-wide web has become the most important infor- mation source for most of us. Unfortunately, there is no guarantee for the correctness of information on the web. Moreover, different web sites often provide conflicting in- formation on a subject, such as different specifications for the same product. In this paper we propose a new problem ca...

Citations

... Therefore, Cheng et al. [24] adopted a reputation mechanism [29]- [31] to their TD in MCS system to replace the weights of MUs with reputations. Moreover, Yin et al. [25] utilized the implications among conflicting data and designed a framework based on TD to identify incorrect and fake data. However, the effectiveness of TD methods is limited by the quantity of the given sensing data. ...
... For instance, Cheng et al. [24] presented a framework combining TD with a zero-knowledge proof in an MCS system to improve the accuracy of the ground truth produced for sensing data while preserving user privacy. Yin et al. [25] leveraged TD to evaluate the trustworthiness of web-based information sources and identify incorrect and fake data. However, these traditional TD methods are usually unsatisfactory in practical applications due to the possibility of sparse sensing data being present in MCS systems. ...
... These implicit correlations remain in scenarios involving multiple sensing targets. Therefore, the data mutually influence their trustworthiness even if they are sensed across different periods and in different regions [25]. We denote this correlation between sensing data points as implication and provide the following definition: ...
Preprint
Mobile crowdsensing (MCS) has emerged as a prominent trend across various domains. However, ensuring the quality of the sensing data submitted by mobile users (MUs) remains a complex and challenging problem. To address this challenge, an advanced method is required to detect low-quality sensing data and identify malicious MUs that may disrupt the normal operations of an MCS system. Therefore, this article proposes a prediction- and reputation-based truth discovery (PRBTD) framework, which can separate low-quality data from high-quality data in sensing tasks. First, we apply a correlation-focused spatial-temporal transformer network to predict the ground truth of the input sensing data. Then, we extract the sensing errors of the data as features based on the prediction results to calculate the implications among the data. Finally, we design a reputation-based truth discovery (TD) module for identifying low-quality data with their implications. Given sensing data submitted by MUs, PRBTD can eliminate the data with heavy noise and identify malicious MUs with high accuracy. Extensive experimental results demonstrate that PRBTD outperforms the existing methods in terms of identification accuracy and data quality enhancement.
... Yin et al. proposed the truth discovery problem, which involves finding the most accurate description for each object from conflicting descriptions provided by multiple data sources [4]. The article proposes a single-truth discovery algorithm based on two assumptions regarding the truth discovery problem. ...
... Voting: a majority voting method that takes the value with the highest number of occurrences in the declared value of the data source as the true value. Truth Finder [4]: uses an iterative approach to compute data source reliability and attribute truth values based on two basic assumptions of truth discovery. CATD [13]: improves the accuracy of estimating the reliability of a data source by taking into account the long-tail phenomenon of the data and reliability confidence intervals, and obtains the true value. ...
Article
Full-text available
As the volume of data continues to grow, it is common for data from the same source to contain multiple domains. Combining domain segmentation can enhance the effectiveness of data fusion. This paper presents a multi-truth discovery algorithm that utilises statement value grouping and domain information richness. The data is first grouped based on their similarity, and the resulting groups replace the original data. Then, the reliability of each data source is calculated by domain. The truth value and reliability of each source are iteratively calculated until the end condition is met. Finally, the appropriate value is selected as the final result from the obtained dataset. Experiments were conducted on real datasets to demonstrate the algorithm's effectiveness.
... Iteration methods [6,7] alternately computes confidence of fact and trustworthiness of source through an iterative procedure. Some methods additionally consider intervalue influence [8] and confidence interval estimation [9] to improve the accuracy. ...
... Both iteration algorithm [6,8,14] and EM algorithm [31], belong to coordinate descent algorithms 3 , but they differ in the methods used for estimating source quality-the former uses a linear/nonlinear function, while the latter makes inference by maximizing the (lower bound of the logarithmic) likelihood of observations over all sources' claims. ...
... • book-author dataset [6] contains 33,971 book-author records crawled from www.abebooks.com. Each record represents a store's claim on the author(s) of a book. ...
Article
Full-text available
Truth discovery is the fundamental technique for resolving the conflicts between the information provided by different data sources by detecting the true values. Traditional methods assume that each data item has only one true value and therefore cannot deal with the circumstances where one data item has multiple true values (i.e., multi-value truth). In this work, we target at this new challenge and propose a generalized Bayesian framework that comprehensively incorporates the features of multi-value truth for the accurate and efficient multi-source data integration. In particular, we identify three key features of multi-value truth, called source-value mapping, differentiated mutual exclusion, and complicated source dependency, to better solve the problem. In particular, sources and values are aggregated based on their mapping to reduce the problem scale, the exclusive relations between values are quantified to reflect the effect of multi-truth, and a fine-grained copy detection method is proposed to deal with complicated source dependency. The data preference of model is also incorporated for fast parameter configuration. Experimental results on real-world and large-scale synthetic datasets demonstrate the effectiveness of our approach, with less execution time and an average 5% higher F1 compared to the latest method.
... • TruthFinder (Yin, Han, and Yu 2008) is one of the earliest methods for detecting fake news using an unsupervised approach. This method employs an iterative process to determine the veracity of news by assessing the credibility of the source websites of the news. ...
Article
With the rise of social media, the spread of fake news has become a significant concern, potentially misleading public perceptions and impacting social stability. Although deep learning methods like CNNs, RNNs, and Transformer-based models like BERT have enhanced fake news detection. However, they primarily focus on content and do not consider social context during news propagation. Graph-based techniques have incorporated the social context but are limited by the need for large labeled datasets. To address these challenges, this paper introduces GAMC, an unsupervised fake news detection technique using the Graph Autoencoder with Masking and Contrastive learning. By leveraging both the context and content of news propagation as self-supervised signals, our method reduces the dependency on labeled datasets. Specifically, GAMC begins by applying data augmentation to the original news propagation graphs. Subsequently, these augmented graphs are encoded using a graph encoder and subsequently reconstructed via a graph decoder. Finally, a composite loss function that encompasses both reconstruction error and contrastive loss is designed. Firstly, it ensures the model can effectively capture the latent features, based on minimizing the discrepancy between reconstructed and original graph representations. Secondly, it aligns the representations of augmented graphs that originate from the same source. Experiments on the real-world dataset validate the effectiveness of our method.
... Similar to [32], the quality of data processing on edge server e j in task i denoted as a j i can be expressed as Eq. (2). ...
Article
Full-text available
With the rapid development and growing popularity of Internet of Things (IoT), edge-assisted crowdsensing has emerged as a new mode of data collection and data processing. In an edge-assisted crowdsensing system, a reasonable data task allocation and pricing mechanism is urgently required to promote the active participation of each part of the system. However, existing mechanisms either did not consider the impact of data quality on participant profits or ignored some parts of the whole system. We therefore propose a novel task allocation and pricing mechanism based on the Stackelberg game model, considering all four parties (data requesters, crowdsensing platform, edge servers and IoT sensors) in an edge-assisted crowdsensing system. Specifically, we decompose the problem into three game sub-problems, and design our mechanism using KKT conditional approaches, with the aim of maximising the benefits of each party in the crowdsensing system. We demonstrate mathematically that the Stackelberg equilibrium can be achieved in all three games, and validate its performance through simulation experiments.
... This can involve basic majority voting or more advanced methods (Carvalho and Larson, 2013;Dawid and Skene, 1979;de Marneffe et al., 2012;Gaunt et al., 2016;Pham et al., 2017;Tian et al., 2019;Warby et al., 2014), including inverse rank normalization (IRN) as discussed in this paper. Often, aggregation is also performed using probabilistic models from the crowdsourcing and truth-discovery literature (Bachrach et al., 2012;Chu et al., 2021;Dong et al., 2009;Gordon et al., 2022;Guan et al., 2018;Li et al., 2012;Rodrigues and Pereira, 2018;Wang et al., 2012;Yin et al., 2008;Zhao et al., 2012), see (Yan et al., 2014;Zheng et al., 2017) for surveys. However, evaluation is often based on point estimates and the impact of annotator disagreement on evaluation is generally poorly understood (Gordon et al., 2021). ...
Preprint
Full-text available
For safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this is actually not the case and the ground truth may be uncertain. Unfortunately, this is largely ignored in standard evaluation of AI models but can have severe consequences such as overestimating the future performance. To avoid this, we measure the effects of ground truth uncertainty, which we assume decomposes into two main components: annotation uncertainty which stems from the lack of reliable annotations, and inherent uncertainty due to limited observational information. This ground truth uncertainty is ignored when estimating the ground truth by deterministically aggregating annotations, e.g., by majority voting or averaging. In contrast, we propose a framework where aggregation is done using a statistical model. Specifically, we frame aggregation of annotations as posterior inference of so-called plausibilities, representing distributions over classes in a classification setting, subject to a hyper-parameter encoding annotator reliability. Based on this model, we propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation. We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses. The deterministic adjudication process called inverse rank normalization (IRN) from previous work ignores ground truth uncertainty in evaluation. Instead, we present two alternative statistical models: a probabilistic version of IRN and a Plackett-Luce-based model. We find that a large portion of the dataset exhibits significant ground truth uncertainty and standard IRN-based evaluation severely over-estimates performance without providing uncertainty estimates.
... Truth finding. Truth finding [4,9,14,27] is an effective tool used to handle uncertain data. More specifically, when a dataset is missing some information and the dataset owner does not have access to this information, they can ask sources questions (or queries) in order to complete the dataset. ...
... Early works in truth-finding algorithms [9,27] show that majority voting is not the best solution to corroborate data when different sources provide conflicting information on it. Interestingly, further studies [4,14] show that no single truthfinding algorithm performs well in all scenarios and benchmarks, we just choose Cosine and 3-Estimates as representative examples of such algorithms. ...
Preprint
Full-text available
Federated knowledge discovery and data mining are challenged to assess the trustworthiness of data originating from autonomous sources while protecting confidentiality and privacy. Truth-finding algorithms help corroborate data from disagreeing sources. For each query it receives, a truth-finding algorithm predicts a truth value of the answer, possibly updating the trustworthiness factor of each source. Few works, however, address the issues of confidentiality and privacy. We devise and present a secure secret-sharing-based multi-party computation protocol for pseudo-equality tests that are used in truth-finding algorithms to compute additions depending on a condition. The protocol guarantees confidentiality of the data and privacy of the sources. We also present variants of truth-finding algorithms that would make the computation faster when executed using secure multi-party computation. We empirically evaluate the performance of the proposed protocol on two state-of-the-art truth-finding algorithms, Cosine, and 3-Estimates, and compare them with that of the baseline plain algorithms. The results confirm that the secret-sharing-based secure multi-party algorithms are as accurate as the corresponding baselines but for proposed numerical approximations that significantly reduce the efficiency loss incurred.
... The objective is to minimize the overall weighted deviation between the truths and the multi-source observations where each source is weighted by its reliability. In (Yin et al., 2008), authors designed a general framework for the Veracity problem and invent an algorithm, called TRUTHFlNDER, which utilizes the relationships between websites and their information, i.e., a website is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy websites. An iterative method is used to infer the trustworthiness of websites and the correctness of information from each other. ...
... dresses the challenge of merging the facts of the same real-world entity into one single fact (Bleiholder and Naumann 2008). To achieve this goal, data fusion is concerned with solving attribute-level conflicts that can originate from disagreeing or poor quality sources and schema-level heterogeneity. Most of the techniques that have been proposed (Yin et al . 2008;Berti-Équille et al . 2009;Dong et al . 2015) adopt a "truth discovery approach" and perform metadata-and instance-based conflict resolution. In this section, we show an example where probabilistic knowledge graphs are effectively used to model a data fusion setting where multiple and mutually dependent sources need to be harmonized. The ...
Article
We provide a framework for probabilistic reasoning in Vadalog-based Knowledge Graphs (KGs), satisfying the requirements of ontological reasoning: full recursion, powerful existential quantification, expression of inductive definitions. Vadalog is a Knowledge Representation and Reasoning (KRR) language based on Warded Datalog+/–, a logical core language of existential rules, with a good balance between computational complexity and expressive power. Handling uncertainty is essential for reasoning with KGs. Yet Vadalog and Warded Datalog+/– are not covered by the existing probabilistic logic programming and statistical relational learning approaches for several reasons, including insufficient support for recursion with existential quantification and the impossibility to express inductive definitions. In this work, we introduce Soft Vadalog, a probabilistic extension to Vadalog, satisfying these desiderata. A Soft Vadalog program induces what we call a Probabilistic Knowledge Graph (PKG), which consists of a probability distribution on a network of chase instances, structures obtained by grounding the rules over a database using the chase procedure. We exploit PKGs for probabilistic marginal inference. We discuss the theory and present MCMC-chase, a Monte Carlo method to use Soft Vadalog in practice. We apply our framework to solve data management and industrial problems and experimentally evaluate it in the Vadalog system.
... In recent times, an information-analytical technique called truth discovery (TD) is often employed in social sensing to deduce the veracity of claims by filtering trustworthy knowledge from social signals [40]. Examples of recent TD schemes include: (i) Expectation-maximization (EM), a maximum likelihood estimation approach to TD that considers human observations as binary variables to estimate event veracity [1]; (ii) Truth Finder, a probabilistic TD algorithm using iterative weight updates [41]; and (iii) Rumor Source Detector, a graph-based TD approach that identifies misleading sources using the spanning tree principle [42]. One key limitation of such TD solutions is that they solely rely on the information contained in the noisy social signals to estimate the veracity instead of employing any physical sensors (e.g., cameras mounted on UAVs) to validate the prediction accuracy and improve the estimation performance [4]. ...
Article
Social airborne sensing (SAS) is taking shape as a new integrated sensing paradigm that melds the human wisdom derived from social data platforms (e.g., Twitter, Facebook) with the empirical sensing capabilities of unmanned aerial vehicles (UAVs) for providing multifaceted information acquisition and situation awareness services in disaster recovery applications. A crucial task in the aftermath of a disaster is to determine the veracity of the reported events alongside assessing their underlying urgency that can possibly facilitate appropriate parties in their disaster mitigation and recovery efforts. For example, identifying a falsely reported event early on which claims fatalities could help divert alleviation efforts to genuinely critical events. However, existing SAS schemes are limited to deducing only the veracity of the reports on social data platforms and are unable to infer the underlying urgency of the events. In this paper, we explore the opportunity to develop a spatiotemporal-aware event investigation framework for SAS that can jointly determine the veracity of reported events as well as infer their underlying urgency and deadlines. However, constructing such an integrated system introduces a few new technical challenges. The first challenge is handling the predominant data sparsity in the incoming social signals. The second challenge is optimizing the UAV deployment and event veracity estimation processes by scrutinizing the highly dynamic and latent correlations among event characteristics. The third challenge is carefully extracting and analyzing latent semantic features embedded in the social media data to infer the event urgency. To address the above challenges, we introduce the Spatiotemporal-aware Event Investigation for SAS (SEIS) framework that harnesses techniques from natural language processing (NLP), deep learning, and spatial–temporal correlation modeling for deducing the event veracity, urgency and their underlying deadlines. Experiments through a real-world disaster recovery dataset demonstrate that SEIS achieves better event veracity estimation, event urgency inference, and deadline hit rate compared to state-of-the-art baselines.