Home
Harbin Institute of Technology
Department of Computer Science and Engineering
Hongzhi Wang

Hongzhi Wang
Harbin Institute of Technology | HIT · Department of Computer Science and Engineering

PHD

About

444

Publications

41,098

Reads

3,185

Citations

My research area includes big data management and analysis, database, knowledge engineering and data quality. I attempt to study data managment and analysis in the dimensions of quantity, quality and intelligence. I am working on the integration of big data and AI.

Skills and Expertise

Data Mining and Knowledge Discovery

Big Data

Large Scale Data Analysis

Automatic Data Processing

Query Processing

Algorithms

Graphs

Information Integration

Databases

December 2015 - present

Harbin Institute of Technology

Department of Computer Science and Engineering
China

Position

Professor (Full)

October 2010 - December 2015

Harbin Institute of Technology

School of Computer Science and Technology
Harbin, China

Position

Professor (Associate)

September 1997 - July 2008

Harbin Institute of Technology

Field of study

Computer Sceince

Publications

EMIT: Micro-Invasive Database Configuration Tuning

Preprint

Jun 2024

The process of database knob tuning has always been a challenging task. Recently, database knob tuning methods has emerged as a promising solution to mitigate these issues. However, these methods still face certain limitations.On one hand, when applying knob tuning algorithms to optimize databases in practice, it either requires frequent updates to...

TodyNet: Temporal Dynamic Graph Neural Network for Multivariate Time Series Classification

Article

Jun 2024

RDBlab: An Artificial Simulation System for RDBMSs

Chapter

May 2024

Automatic time series forecasting model design based on pruning

Article

May 2024

ANSWER: Automatic Index Selector for Knowledge Graphs

Chapter

Apr 2024

One Seed, Two Birds: A Unified Learned Structure for Exact and Approximate Counting

Article

Mar 2024

The modern database has many precise and approximate counting requirements. Nevertheless, a solitary multidimensional index or cardinality estimator is insufficient to cater to the escalating demands across all counting scenarios. Such approaches are constrained either by query selectivity or by the compromise between query accuracy and efficiency....

DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning

Article

Mar 2024

Neural predictors have shown great potential in the evaluation process of neural architecture search (NAS). However, current predictor-based approaches overlook the fact that training a predictor necessitates a considerable number of trained neural networks as the labeled training set, which is costly to obtain. Therefore, the critical issue in uti...

Schema Integration on Massive Data Sources

Chapter

Mar 2024

As the fundamental phrase of collecting and analyzing data, data integration is used in many applications, such as data cleaning, bioinformatics and pattern recognition. In big data era, one of the major problems of data integration is to obtain the global schema of data sources since the global schema could be hardly derived from massive data sour...

Approximate Query Processing Based on Approximate Materialized View

Chapter

Mar 2024

In the context of big data, the interactive analysis database system needs to answer aggregate queries within a reasonable response time. The proposed AQP++ framework can integrate data preprocessing and AQP. It connects existing AQP engine with data preprocessing method to complete the connection between them in the process of interaction analysis...

Fig. 4 Comparison of the number of clusters of two clustering methods

Fig. 6 The query component extraction results obtained by two-stage...

Fig. 11 The traffic sequence prediction results of the query clusters...

Fig. 13 Comparison of sub-item estimation results of three QDSPNs and...

Fig. 15 Comparison of sub-item estimation results of three QDSPNs and...

Cardinality estimation based on QDSPN for embedded databases under dynamic workload

Preprint

Full-text available

Jan 2024

Cardinality estimation has been a pivotal and enduring research focus within database query optimization. While significant advancements have been made in estimating cardinalities for both individual tables and complex multi-table joins, there remains a notable gap in research pertaining to embedded database scenarios. Embedded databases are typica...

A Shapelet-Based Framework for Unsupervised Multivariate Time Series Representation Learning

Article

Jan 2024

Recent studies have shown great promise in unsupervised representation learning (URL) for multivariate time series, because URL has the capability in learning generalizable representation for many downstream tasks without using inaccessible labels. However, existing approaches usually adopt the models originally designed for other domains (e.g., co...

Dirty Data Processing for Machine Learning

Book

Jan 2024

Multimodal Data Modeling Technology and Its App-19lication for Cloud-edge-device Collaboration

Article

Jan 2024

AutoSR: Automatic Sequential Recommendation System Design

Article

Jan 2024

underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S equential R ecommendation (SR) System emerged recently as a powerful tool for suggesting users with the next item of interest. Despite their great success, the design of SR systems requires heavy manual work and domain knowledge. In this paper,...

Database-Integrated Machine Learning for Enhanced Performance

Conference Paper

Dec 2023

LSTM-based Flow Prediction

Conference Paper

Dec 2023

Automated Feature Interaction and Feature Representation Learning of Multi-field Categorical Data

Conference Paper

Dec 2023

A Novel Approximation Algorithm for Max-Covering Circle Problem

Chapter

Dec 2023

We study the efficient approximation algorithm for max-covering circle problem. Given a set of weighted points in the plane and a circle with specified size, max-covering circle problem is to find the proper place where the center of the circle is located so that the total weight of the points covered by the circle is maximized. Our core approach i...

Density-Based Clustering for Incomplete Data

Chapter

Nov 2023

In real world, missing values exist in a lot of data sets and cause data incompleteness. However, traditional missing value imputation methods are not suitable for density-based clustering and affect the accuracy of clustering results. To solve this problem, this chapter designs a novel density-based clustering model for incomplete data which execu...

Dirty Data Impacts on Regression Models

Chapter

Nov 2023

Due to the negative influence of dirty data on the accuracy of regression models, the relation between the data quality and model results is able to be used in the selection of proper regression models and dirty data repairing strategies. Motivated by this, we develop an evaluation framework to measure the dirty data impacts on regression models. B...

Feature Selection on Inconsistent Data

Chapter

Nov 2023

With the explosive growth of data size, inconsistent data appear more frequently. Due to inconsistent data detection and repairing in data preprocessing, feature selection approaches are lack of efficiency. To avoid this problem, we develop a novel feature selection method on inconsistent data which considers the inconsistency issues into the proce...

Cost-Sensitive Decision Tree Induction on Dirty Data

Chapter

Nov 2023

As the rapid growth of data in our society, dirty data are increasingly common. In the process of cost-sensitive decision tree induction, dirty data in training data sets have negative impacts on the selection of splitting attributes and division of decision tree nodes. Hence, dirty data cleaning is necessary before classification tasks. However, m...

Incomplete Data Classification with View-Based Decision Tree

Chapter

Nov 2023

Missing values bring negative influence in data analyses and decrease the accuracy of machine learning models. Since traditional classification methods are only able to be adopted on complete data sets, this chapter presents a generalized classification model for incomplete data in which existing classification models are easily embedded. We first...

Impacts of Dirty Data on Classification and Clustering Models

Chapter

Nov 2023

Since dirty data have negative influence on the accuracy of machine learning models, the relation between data quality and model results could be used in the selection of the proper model and data cleaning strategies. However, rare work has focused on this topic. Motivated by this, this chapter compares the impacts of missing, inconsistent, and con...

Ensemble feature selection with adaptive weights

Conference Paper

Oct 2023

Automated Graph Neural Network Search Under Federated Learning Framework

Article

Oct 2023

underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">G raph N eural N etwork (GNN) has achieved great success in the field of graph data processing and analysis, but the design of GNN architecture is difficult and time-consuming. To reduce the development cost of GNNs, recently, some GNN N eur...

Auto-TSA: An Automatic Time Series Analysis System Based on Meta-learning

Chapter

Sep 2023

Time series is a necessary data type in both industrial scenarios and data analysis. In this era of explosive data growth, the significant development of sensors has made it possible to obtain massive amounts of time series data. However, the performance of different algorithms for different types of time series data varies greatly. So how to autom...

Correction to: Auto-TSA: An Automatic Time Series Analysis System Based on Meta-learning

Chapter

Sep 2023

BatchMotif(S,Lmin,Lmax)\documentclass[12pt]{minimal}...

AnytimeMotif(S,Lmin,Lmax\documentclass[12pt]{minimal}...

Illustration of the upper and lower envelopes generated from...

Illustration of relationship between subsequences...

The precision of predicting the optimal lower bounds and the speedup...

Discovering time series motifs of all lengths using dynamic time warping

Article

Full-text available

Sep 2023

Motif discovery is a fundamental operation in the analysis of time series data. Existing motif discovery algorithms that support Dynamic Time Warping require manual determination of the exact length of motifs. However, setting appropriate length for interesting motifs can be challenging and selecting inappropriate motif lengths may result in valuab...

Complex Time Series Analysis Based on Conditional Random Fields

Chapter

Sep 2023

A fundamental problem with complex time series analysis involves data prediction and repair. However, existing methods are not accurate enough for complex and multidimensional time series data. In this paper, we propose a novel approach, a complex time series prediction model, which is based on the conditional random field (CRF) and recurrent neura...

Prediction of Time Series Data with Low Latitude Features

Chapter

Sep 2023

The main purpose of this paper is to study the key technology for the prediction of time series data. It has a very wide range of applications, such as forecasting sales. Forecasting sales can be said to play an important role in company operations. Whether for saving costs or inventory scheduling, accurate prediction can save unnecessary waste. Fr...

Dimension Reduction Based on Sampling

Chapter

Sep 2023

Dimension reduction provides a powerful means of reducing the number of random variables under consideration. However, there were many similar tuples in large datasets, and before reducing the dimension of the dataset, we removed some similar tuples to retain the main information of the dataset while accelerating the dimension reduction. Accordingl...

SAT: sampling acceleration tree for adaptive database repartition

Article

Full-text available

Aug 2023

Nowadays, the volume of online data stored on websites is constantly increasing, and users’ demand for faster query response times is also on the rise with the expansion of network bandwidth. To improve the efficiency of database query, many large enterprises use database partitioning to divide huge database tables and speed up query results. While...

Time Series Compression based on Reinforcement Learning

Article

Aug 2023

Nowadays, sensors and signal catchers in various fields are capturing time-series data all the time, and time-series data are exploding. Due to the large storage space requirements and redundancy, many compression techniques for time series have been proposed. However, the existing compression algorithms still face the challenge of the contradictio...

Search For Deep Graph Neural Networks

Article

Aug 2023

IWEK: An Interpretable What-If Estimator for Database Knobs

Preprint

Jul 2023

The knobs of modern database management systems have significant impact on the performance of the systems. With the development of cloud databases, an estimation service for knobs is urgently needed to improve the performance of database. Unfortunately, few attentions have been paid to estimate the performance of certain knob configurations. To fil...

Duet: efficient and scalable hybriD neUral rElation undersTanding

Preprint

Jul 2023

Cardinality estimation methods based on probability distribution estimation have achieved high-precision estimation results compared to traditional methods. However, the most advanced methods suffer from high estimation costs due to the sampling method they use when dealing with range queries. Also, such a sampling method makes them difficult to di...

SoftStep relaxation for mining optimal convolution kernel

Article

Jul 2023

The relationship between features and performance

Workflow of the Benchmark (T is the end of the test, R is the end of...

Automatic single table storage structure selection for hybrid workload

Article

Full-text available

Jun 2023

In the use of database systems, the design of the storage engine and data model directly affects the performance of the database when performing queries. Therefore, the users of the database need to select the storage engine and design data model according to the workload encountered. However, in a hybrid workload, the query set of the database is...

The main concept of the data quality model assessment [22]

Data quality model for assessing public COVID-19 big datasets

Article

Full-text available

May 2023

For decision-making support and evidence based on healthcare, high quality data are crucial, particularly if the emphasized knowledge is lacking. For public health practitioners and researchers, the reporting of COVID-19 data need to be accurate and easily available. Each nation has a system in place for reporting COVID-19 data, albeit these system...

Contrastive Shapelet Learning for Unsupervised Multivariate Time Series Representation Learning

Preprint

Full-text available

May 2023

Figure 5: Distribution of workload selectivity (Sampled 10% from the...

Q-errors,Avg latency(ms) on 3 real-world Datasets's 3 workloads

One stone, two birds: A lightweight multidimensional learned index with cardinality support

Preprint

Full-text available

May 2023

Innovative learning based structures have recently been proposed to tackle index and cardinality estimation tasks, specifically learned indexes and data driven cardinality estimators. These structures exhibit excellent performance in capturing data distribution, making them promising for integration into AI driven database kernels. However, accurat...

Illustrative molecular communication system (MCS) model for TDDS

End-to-End molecular communication system

End-to-End channel capacity-based molecular diffusion

Representation drug molecules by rectangle pluses

Analytical framework for end-to-end channel capacity in molecular communication system

Article

Full-text available

May 2023

The molecular communication system (MCS) is mainly based on the design structure of the nanodevices which are employed as nano-transmitter (Nano-TX) and nano-receiver (Nano-RX), owing to the limited drug-reservoir capacity. The current work addresses the physical design of such nanodevices and the coordination of molecular communication to accompli...

Figure 1. Multi-source data repairing methods.

Multi-Source Data Repairing: A Comprehensive Survey

Article

Full-text available

May 2023

In the era of Big Data, integrating information from multiple sources has proven valuable in various fields. To ensure a high-quality supply of multi-source data, repairing different types of errors in the multi-source data becomes critical. This paper categorizes errors in multi-source data into entity information overlapping, attribute value conf...

SNN-AAD: Active Anomaly Detection Method for Multivariate Time Series with Sparse Neural Network

Chapter

Apr 2023

Anomaly detection of time series data is an important and popular problem in both research and application fields. Kinds of solutions have been developed to uncover the anomaly instances from data. However, the labelled data is always limited and costly for real applications, which adds to the difficulty of identifying various anomalies in multivar...

Cleanits-MEDetect: Multiple Errors Detection for Time Series Data in Cleanits

Chapter

Apr 2023

Data quality problems are seriously prevalent in time series data, and the data suffer from types of errors including single-point errors, continuous errors, and contextual errors. Since it is challenging to achieve high accuracy and efficiency in error detection tasks for time series data, we develop error detection system MEDetect in Cleanits, a...

CnosDB: A Flexible Distributed Time-Series Database for Large-Scale Data

Chapter

Apr 2023

With the development of the Internet of Things, the time series data generated by monitors, analyzers, and detection instruments in the industry has surged. The management of very large-scale time series data faces great challenges. However, the current distributed time series database is still poor in terms of data storage efficiency and data writ...

PFKMaster: A Knowledge-Driven Flow Control System for Large-Scale Power Grid

Chapter

Apr 2023

Various stability analyses of the power system are based on the results of power flow calculation, which is not always convergent. In practice, large manual efforts are required to be repeated many times by electrical engineers to ensure the convergence of power flow calculation, which consumes much manpower and time cost. Motivated by this, we dev...

TodyNet: Temporal Dynamic Graph Neural Network for Multivariate Time Series Classification

Preprint

Full-text available

Apr 2023

Multivariate time series classification (MTSC) is an important data mining task, which can be effectively solved by popular deep learning technology. Unfortunately, the existing deep learning-based methods neglect the hidden dependencies in different dimensions and also rarely consider the unique dynamic features of time series, which lack sufficie...

Reachability Queries with Label and Substructure Constraints on Knowledge Graphs (Extended abstract)

Conference Paper

Apr 2023

TSC-AutoML: Meta-learning for Automatic Time Series Classification Algorithm Selection

Conference Paper

Apr 2023

UniTS: A Universal Time Series Analysis Framework with Self-supervised Representation Learning

Preprint

Full-text available

Mar 2023

Machine learning has emerged as a powerful tool for time series analysis. Existing methods are usually customized for different analysis tasks and face challenges in tackling practical problems such as partial labeling and domain shift. To achieve universal analysis and address the aforementioned problems, we develop UniTS, a novel framework that i...

Identifying Effective Trajectory Predictions Under the Guidance of Trajectory Anomaly Detection Model

Article

Mar 2023

DCLP: Neural Architecture Predictor with Curriculum Contrastive Learning

Preprint

Feb 2023

Neural predictors currently show great potential in the performance evaluation phase of neural architecture search (NAS). Despite their efficiency in the evaluation process, it is challenging to train the predictor with fewer architecture evaluations for efficient NAS. However, most of the current approaches are more concerned with improving the st...

TS-Cabinet: Hierarchical Storage for Cloud-Edge-End Time-series Database

Preprint

Feb 2023

Hierarchical data storage is crucial for cloud-edge-end time-series database. Efficient hierarchical storage will directly reduce the storage space of local databases at each side and improve the access hit rate of data. However, no effective hierarchical data management strategy for cloud-edge-end time-series database has been proposed. To solve t...

FedST: Federated Shapelet Transformation for Interpretable Time Series Classification

Preprint

Full-text available

Feb 2023

This paper studies how to develop accurate and interpretable time series classification (TSC) models with the help of external data in a privacy-preserving federated learning (FL) scenario. To the best of our knowledge, we are the first to study on this essential topic. Achieving this goal requires us to seamlessly integrate the techniques from mul...

HyGGE: Hyperbolic Graph Attention Network for Reasoning over Knowledge Graphs

Article

Feb 2023

Ts-Cabinet: Hierarchical Storage for Cloud-Edge-End Time-Series Database

Article

Jan 2023

Search for Deep Graph Neural Networks

Article

Jan 2023

Time Series Compression Based on Reinforcement Learning

Preprint

Jan 2023

Traffic flow prediction performance comparison by three evaluation metrics

TransFusion Model Fusion Mechanism Based on Transformer for Traffic Flow Prediction

Article

Full-text available

Jan 2023

In recent years, the problem of traffic congestion has become a hot topic. Accurate traffic flow prediction methods have received extensive attention from many researchers all over the world. Although many methods proposed at present have achieved good results in the field of traffic flow prediction, most of them only consider the static characteri...

Efficient Skycube Computation on High Dimensional Data

Preprint

Jan 2023

CUBE: Causal Intervention-Based Counterfactual Explanation for Prediction Models

Article

Jan 2023

Recent several years have witnessed the rapid explosion of artificial intelligence applied in various domains with the surpassing human-level performance. Despite the success, these models' underlying mechanisms remain a mystery, as their complicated representations make human understanding impossible. This mystery may cause discrimination and non-...

Auto-STGCN: Autonomous Spatial-Temporal Graph Convolutional Network Search

Article

Dec 2022

In recent years, many spatial-temporal graph convolutional network (STGCN) models are proposed to deal with the spatial-temporal network data forecasting problem. These STGCN models have their own advantages, i.e., each of them puts forward many effective operations and achieves good prediction results in the real applications. If users can effecti...

EFFECT: Explainable Framework for Meta-learning in Automatic Classification Algorithm Selection

Article

Dec 2022

With the growing convergence of artificial intelligence and daily life scenarios, the application scenarios for intelligent decision methods are becoming increasingly complex. The development of various machine learning algorithms has benefited all disciplines of study, but choosing which algorithm is most suitable for a certain problem among a lar...

AAE: An Active Auto-Estimator for Improving Graph Storage

Article

Dec 2022

IoT data cleaning techniques: A survey

Article

Full-text available

Dec 2022

Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness. This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things (IoT) data with time...

Parallel Skyline Query Processing of Massive Incomplete Activity-Trajectories Data

Chapter

Nov 2022

The big spatial temporal data captured from technology tools produce massive amount of trajectories data collected from GPS devices. The top-k query was proposed by many researchers, on which they used distance and text parameters for processing. However, the information related to text parameter like activity is always not presented due to some re...

Figure 1: The Architecture of Knowledge-Enhanced Learned Cardinality...

Figure 4: The Q-errors of MSCN (dotted line) and the knowledge enhanced...

Figure 5: The Q-errors of MSCN (dotted line) and the knowledge enhanced...

Figure 6: The mean Q-errors of re-training interval = 2.5K, 5K, 7.5K, 10K.

KELM: Knowledge-Enhanced Learning Methodology for Cardinality Estimation

Preprint

Full-text available

Nov 2022

Recent years, the database community has attempted to develop learned cardinality estimator for improving the estimation accuracy. Although some researches show that the applying deep learning to cardinality estimation is a significant and promising direction, there still exists many problems in implementing these techniques to real applications (l...

EEML: Ensemble Embedded Meta-Learning

Chapter

Nov 2022

To accelerate learning process with few samples, meta-learning resorts to prior knowledge from previous tasks. However, the inconsistent task distribution and heterogeneity is hard to be handled through a global sharing model initialization. In this paper, based on gradient-based meta-learning, we propose an ensemble embedded meta-learning algorith...

A linear algorithm for semi-external cutnode computation

Article

Nov 2022

In the literature, many algorithms have been proposed for finding cutnodes on undirected graphs, since cutnodes are crucial to graph connectivity. Here, a cutnode of an undirected graph G is a node of G, whose deletion will cause a reachable pair of the other nodes in G to be unreachable. Currently, the difficulty of maintaining the entire G in the...

Differentiable Self-Adaptive Learning Rate

Preprint

Oct 2022

Learning rate adaptation is a popular topic in machine learning. Gradient Descent trains neural nerwork with a fixed learning rate. Learning rate adaptation is proposed to accelerate the training process through adjusting the step size in the training session. Famous works include Momentum, Adam and Hypergradient. Hypergradient is the most special...

Fig. 1. The workflow of the TPAD algorithm. Y p * M,i ∈ R t pred ×2 (i...

Fig. 4. Suppose there are 4 pedestrians in the scene and a stochastic...

Fig. 7. Quality comparison of the diversified prediction results...

Identifying Effective Trajectory Predictions Under the Guidance of Trajectory Anomaly Detection Model

Preprint

Full-text available

Oct 2022

p>Trajectory Prediction (TP) is an important research topic in computer vision and robotics ﬁelds. Recently, many stochastic TP models have been proposed to deal with this problem and have achieved better performance than the traditional models with deterministic trajectory outputs. However, these stochastic models can generate a number of future t...

A Generative Adversarial Active Learning Method for Effective Outlier Detection

Conference Paper

Oct 2022

The Overview of our model architecture for the...

The precision-recall curve of comparison with existing work

A meta learning approach for open information extraction

Article

Full-text available

Aug 2022

As one of the most important research topics in the field of natural language processing, open information extraction has achieved gratifying research findings in recent years. Even if so much effort is put into the work of open information extraction, there are still many shortcomings and great room for improvement in the existing system. The trad...

FedTSC: a secure federated learning system for interpretable time series classification

Article

Full-text available

Aug 2022

We demonstrate FedTSC, a novel federated learning (FL) system for interpretable time series classification (TSC). FedTSC is an FL-based TSC solution that makes a great balance among security, interpretability, accuracy, and efficiency. We achieve this by first extending the concept of FL to consider both stronger security and model interpretability...

Time Series Data Quality Enhancing Based on Pattern Alignment

Chapter

Jul 2022

We have witnessed the rapid evolution of data intelligence benefiting the decision making of complex multi-equipment systems. Collected by sensors on the equipment temporally, such data indicates the opportunity of real-time analysis and workflow optimization, while bringing data quality challenges to the specialists. The usage of low quality data...

METransE: Manifold-like mechanism enhanced embedding for reasoning over knowledge graphs

Article

Jul 2022

The knowledge graph (KG) has attracted much concern due to its positive effect on AI-related applications. While even for those large-scale KGs widely used, they are still far from being complete and comprehensive. This prompts reasoning over KGs to be one of the most basic and attention-grabbing tasks. However, most existing reasoning methods only...

AAE: An Active Auto-Estimator for Improving Graph Storage

Preprint

Jun 2022

Nowadays, graph becomes an increasingly popular model in many real applications. The efficiency of graph storage is crucial for these applications. Generally speaking, the tune tasks of graph storage rely on the database administrators (DBAs) to find the best graph storage. However, DBAs make the tune decisions by mainly relying on their experience...

EEML: Ensemble Embedded Meta-learning

Preprint

Jun 2022

Pattern Discovery for Heterogeneous Data

Chapter

Jun 2022

In the field of knowledge discovery for multi-source homogeneous data, for an entity, its correct value is found by resolving conflicts among multiple sources of information. However, due to missing values and inefficient entity matching, a single entity’s information is often insufficient in practical applications. This phenomenon requires pattern...

Functional-Dependency-Based Truth Discovery for Isomorphic Data

Chapter

Jun 2022

It is unavoidable that errors occur in databases. Reasons include recording errors, stale data, and even intentional errors. Such mistakes may cause serious consequences. It is impossible to correct those errors manually at scale. In fact, it is hard for people to even detect errors. However, since errors often occur rather randomly, they may cause...

Fact Discovery for Text Data

Chapter

Jun 2022

Fact extraction, which aims to extract (entity, attribute, value)-tuples from massive text corpora, is crucial in text data mining. Recent approaches focus on extracting facts by mining textual patterns with semantic types, where the quality of a pattern is evaluated based on content-based criteria, such as frequency. However, these approaches over...

Denial-Constraint-Based Truth Discovery for Isomorphic Data

Chapter

Jun 2022

Aggregating accurate information from multi-source conflicting data is crucial. A common approach to address this problem is Voting/Averaging. However, such methods usually fail to achieve correct results, since they assume that all the sources are equally reliable. In most cases, the information quality usually varies a lot among diversified sourc...

ATPFL: Automatic Trajectory Prediction Model Design under Federated Learning Framework

Conference Paper

Full-text available

Jun 2022

Although the Trajectory Prediction (TP) model has achieved great success in computer vision and robotics ﬁelds, its architecture and training scheme design rely on heavy manual work and domain knowledge, which is not friendly to common users. Besides, the existing works ignore Federated Learning (FL) scenarios, failing to make full use of distribut...

FIND:Explainable Framework for Meta-learning

Preprint

May 2022

Meta-learning is used to efficiently enable the automatic selection of machine learning models by combining data and prior knowledge. Since the traditional meta-learning technique lacks explainability, as well as shortcomings in terms of transparency and fairness, achieving explainability for meta-learning is crucial. This paper proposes FIND, an i...

JointMatcher: Numerically-aware entity matching using pre-trained language models with attention concentration

Article

May 2022

Entity matching (EM) aims to identify whether two records refer to the same underlying real-world entity. Traditional entity matching methods mainly focus on structured data, where the attribute values are short and atomic. Recently, there has been an increasing demand for matching textual records, such as matching descriptions of products that cor...

Partial multi-label learning via specific label disambiguation

Article

May 2022

Partial Multi-Label Learning (PML) aims to learn a robust multi-label classifier from training data, where each instance is associated with a set of candidate labels, among which only a subset of them is relevant. Some existing methods consider the noise in the feature space and have made some achievements. However, they ignored that each label mig...

A Dual-Store Structure for Knowledge Graphs (Extended Abstract)

Conference Paper

May 2022

Automatic Scheduling Technology of Computing Power Network Driven by Knowledge Graph

Conference Paper

May 2022

CO-AutoML: An Optimizable Automated Machine Learning System

Chapter

Full-text available

Apr 2022

In recent years, many automated machine learning (AutoML) techniques are proposed for the automatic selection or design machine learning models. They bring great convenience to the use of machine learning techniques, but are difficult for users without programming experiences to use, and lack of effective optimization scheme to respond to users’ di...

AutoTS: Automatic Time Series Forecasting Model Design Based on Two-Stage Pruning

Preprint

Mar 2022

Automatic Time Series Forecasting (TSF) model design which aims to help users to efficiently design suitable forecasting model for the given time series data scenarios, is a novel research topic to be urgently solved. In this paper, we propose AutoTS algorithm trying to utilize the existing design skills and design efficient search methods to effec...

Overview of Performance Predictor for Knowledge Graph Stores

An Instance of Predicate Connected Graph

PreKar: A learned performance predictor for knowledge graph stores

Article

Full-text available

Mar 2022

Effective knowledge graph storage management is identified as the basic premise to make full use of knowledge graphs. Due to the lack of performance evaluation for knowledge graph stores, it is difficult for users to decide which one is the best. However, none of existing studies of performance prediction focuses on storage structures. To fill this...

The actions of dividing and merging tables

Reinforcement learning network structure

Query time changes with the number of training episodes

GSBRL : Efficient RDF graph storage based on reinforcement learning

Article

Full-text available

Mar 2022

Knowledge is the cornerstone of artificial intelligence, which is often represented as RDF graphs. The large-scale RDF graphs in various fields pose new challenges to graph data management. Due to the maturity and stability, relational database is a good choice for RDF graph storage. However, the management of the complex structure of RDF graphs in...

Applicability evaluation for function-distribution pairs. Cases where...

Applicability evaluation for averages of distribution pairs. Cases...

Efficiency evaluation for algorithms that work under the...

Efficiency evaluation for algorithms providing ordering guarantee

MISS: finding optimal sample sizes for approximate analytics

Article

Full-text available

Mar 2022

Nowadays, sampling-based Approximate Query Processing (AQP) is widely regarded as a promising way to achieve interactivity in big data analytics. To build such an AQP system, finding the minimal sample size for a query regarding given error constraints in general, called Sample Size Optimization (SSO), is an essential yet unsolved problem. Ideally,...

Evaluating Community Quality Based on Ground-truth

Article

Mar 2022

An effective Community Scoring Function (CSF) is very important since it can properly quantify the community quality of the node groups and helped us to effectively discover valuable network communities. Currently, researchers have proposed many types of CSFs. However, none of them were based on an experimental and theoretical analysis of the node...

Efficient Semi-External Depth-First Search

Article

Mar 2022

As graphs grow in size, many real-world graphs are difficult to load into the primary memory of a computer. Thus, computing depth-first search (DFS) results (i.e., depth-first order or DFS-Tree) on the semi-external memory model is important to investigate. Semi-external algorithms assume that the primary memory can at least hold a spanning tree T...

Fig. 1: The fragment of a knowledge graph

Initial HIT@1 results and enhanced results

Dynamic Relation Repairing for Knowledge Enhancement

Preprint

Full-text available

Feb 2022

Dynamic relation repair aims to efficiently validate and repair the instances for knowledge graph enhancement (KGE), where KGE captures missing relations from unstructured data and leads to noisy facts to the knowledge graph. With the prosperity of unstructured data, an online approach is asked to clean the new RDF tuples before adding them to the...

Similarity Analysis Based on Time Windows

Predict industrial equipment failure with time windows and transfer learning

Article

Full-text available

Feb 2022

Sensors, while more widely implemented in industry, have generated a large number of high-dimension unlabeled time series data during the process of the complicated producing. If putting these data to use, we can predict and preclude malfunctions of specific industrial facilities so that there will be less pecuniary lost. In this paper, we propose...

Auto-CASH: A meta-learning embedding approach for autonomous classification algorithm selection

Article

Jan 2022

With years of development, machine learning algorithms have excellent performance in some tasks of data analysis and data mining. To apply machine learning to new tasks, suitable algorithm and hyperparameters selection techniques, which is known as Combined Algorithm Selection and Hyperparameter optimization problem, are in demand. In the field of...