Hongzhi Wang

Hongzhi Wang
Harbin Institute of Technology | HIT · Department of Computer Science and Engineering

PHD

About

444
Publications
41,098
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,185
Citations
Introduction
My research area includes big data management and analysis, database, knowledge engineering and data quality. I attempt to study data managment and analysis in the dimensions of quantity, quality and intelligence. I am working on the integration of big data and AI.
Additional affiliations
December 2015 - present
Harbin Institute of Technology
Position
  • Professor (Full)
October 2010 - December 2015
Harbin Institute of Technology
Position
  • Professor (Associate)
Education
September 1997 - July 2008
Harbin Institute of Technology
Field of study
  • Computer Sceince

Publications

Publications (444)
Preprint
The process of database knob tuning has always been a challenging task. Recently, database knob tuning methods has emerged as a promising solution to mitigate these issues. However, these methods still face certain limitations.On one hand, when applying knob tuning algorithms to optimize databases in practice, it either requires frequent updates to...
Article
The modern database has many precise and approximate counting requirements. Nevertheless, a solitary multidimensional index or cardinality estimator is insufficient to cater to the escalating demands across all counting scenarios. Such approaches are constrained either by query selectivity or by the compromise between query accuracy and efficiency....
Article
Neural predictors have shown great potential in the evaluation process of neural architecture search (NAS). However, current predictor-based approaches overlook the fact that training a predictor necessitates a considerable number of trained neural networks as the labeled training set, which is costly to obtain. Therefore, the critical issue in uti...
Chapter
As the fundamental phrase of collecting and analyzing data, data integration is used in many applications, such as data cleaning, bioinformatics and pattern recognition. In big data era, one of the major problems of data integration is to obtain the global schema of data sources since the global schema could be hardly derived from massive data sour...
Chapter
In the context of big data, the interactive analysis database system needs to answer aggregate queries within a reasonable response time. The proposed AQP++ framework can integrate data preprocessing and AQP. It connects existing AQP engine with data preprocessing method to complete the connection between them in the process of interaction analysis...
Preprint
Full-text available
Cardinality estimation has been a pivotal and enduring research focus within database query optimization. While significant advancements have been made in estimating cardinalities for both individual tables and complex multi-table joins, there remains a notable gap in research pertaining to embedded database scenarios. Embedded databases are typica...
Article
Recent studies have shown great promise in unsupervised representation learning (URL) for multivariate time series, because URL has the capability in learning generalizable representation for many downstream tasks without using inaccessible labels. However, existing approaches usually adopt the models originally designed for other domains (e.g., co...
Article
underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S equential R ecommendation (SR) System emerged recently as a powerful tool for suggesting users with the next item of interest. Despite their great success, the design of SR systems requires heavy manual work and domain knowledge. In this paper,...
Chapter
We study the efficient approximation algorithm for max-covering circle problem. Given a set of weighted points in the plane and a circle with specified size, max-covering circle problem is to find the proper place where the center of the circle is located so that the total weight of the points covered by the circle is maximized. Our core approach i...
Chapter
In real world, missing values exist in a lot of data sets and cause data incompleteness. However, traditional missing value imputation methods are not suitable for density-based clustering and affect the accuracy of clustering results. To solve this problem, this chapter designs a novel density-based clustering model for incomplete data which execu...
Chapter
Due to the negative influence of dirty data on the accuracy of regression models, the relation between the data quality and model results is able to be used in the selection of proper regression models and dirty data repairing strategies. Motivated by this, we develop an evaluation framework to measure the dirty data impacts on regression models. B...
Chapter
With the explosive growth of data size, inconsistent data appear more frequently. Due to inconsistent data detection and repairing in data preprocessing, feature selection approaches are lack of efficiency. To avoid this problem, we develop a novel feature selection method on inconsistent data which considers the inconsistency issues into the proce...
Chapter
As the rapid growth of data in our society, dirty data are increasingly common. In the process of cost-sensitive decision tree induction, dirty data in training data sets have negative impacts on the selection of splitting attributes and division of decision tree nodes. Hence, dirty data cleaning is necessary before classification tasks. However, m...
Chapter
Missing values bring negative influence in data analyses and decrease the accuracy of machine learning models. Since traditional classification methods are only able to be adopted on complete data sets, this chapter presents a generalized classification model for incomplete data in which existing classification models are easily embedded. We first...
Chapter
Since dirty data have negative influence on the accuracy of machine learning models, the relation between data quality and model results could be used in the selection of the proper model and data cleaning strategies. However, rare work has focused on this topic. Motivated by this, this chapter compares the impacts of missing, inconsistent, and con...
Article
underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">G raph N eural N etwork (GNN) has achieved great success in the field of graph data processing and analysis, but the design of GNN architecture is difficult and time-consuming. To reduce the development cost of GNNs, recently, some GNN N eur...
Chapter
Time series is a necessary data type in both industrial scenarios and data analysis. In this era of explosive data growth, the significant development of sensors has made it possible to obtain massive amounts of time series data. However, the performance of different algorithms for different types of time series data varies greatly. So how to autom...
Article
Full-text available
Motif discovery is a fundamental operation in the analysis of time series data. Existing motif discovery algorithms that support Dynamic Time Warping require manual determination of the exact length of motifs. However, setting appropriate length for interesting motifs can be challenging and selecting inappropriate motif lengths may result in valuab...
Chapter
A fundamental problem with complex time series analysis involves data prediction and repair. However, existing methods are not accurate enough for complex and multidimensional time series data. In this paper, we propose a novel approach, a complex time series prediction model, which is based on the conditional random field (CRF) and recurrent neura...
Chapter
The main purpose of this paper is to study the key technology for the prediction of time series data. It has a very wide range of applications, such as forecasting sales. Forecasting sales can be said to play an important role in company operations. Whether for saving costs or inventory scheduling, accurate prediction can save unnecessary waste. Fr...
Chapter
Dimension reduction provides a powerful means of reducing the number of random variables under consideration. However, there were many similar tuples in large datasets, and before reducing the dimension of the dataset, we removed some similar tuples to retain the main information of the dataset while accelerating the dimension reduction. Accordingl...
Article
Full-text available
Nowadays, the volume of online data stored on websites is constantly increasing, and users’ demand for faster query response times is also on the rise with the expansion of network bandwidth. To improve the efficiency of database query, many large enterprises use database partitioning to divide huge database tables and speed up query results. While...
Article
Nowadays, sensors and signal catchers in various fields are capturing time-series data all the time, and time-series data are exploding. Due to the large storage space requirements and redundancy, many compression techniques for time series have been proposed. However, the existing compression algorithms still face the challenge of the contradictio...
Preprint
The knobs of modern database management systems have significant impact on the performance of the systems. With the development of cloud databases, an estimation service for knobs is urgently needed to improve the performance of database. Unfortunately, few attentions have been paid to estimate the performance of certain knob configurations. To fil...
Preprint
Cardinality estimation methods based on probability distribution estimation have achieved high-precision estimation results compared to traditional methods. However, the most advanced methods suffer from high estimation costs due to the sampling method they use when dealing with range queries. Also, such a sampling method makes them difficult to di...
Article
Full-text available
In the use of database systems, the design of the storage engine and data model directly affects the performance of the database when performing queries. Therefore, the users of the database need to select the storage engine and design data model according to the workload encountered. However, in a hybrid workload, the query set of the database is...
Article
Full-text available
For decision-making support and evidence based on healthcare, high quality data are crucial, particularly if the emphasized knowledge is lacking. For public health practitioners and researchers, the reporting of COVID-19 data need to be accurate and easily available. Each nation has a system in place for reporting COVID-19 data, albeit these system...
Preprint
Full-text available
Recent studies have shown great promise in unsupervised representation learning (URL) for multivariate time series, because URL has the capability in learning generalizable representation for many downstream tasks without using inaccessible labels. However, existing approaches usually adopt the models originally designed for other domains (e.g., co...
Preprint
Full-text available
Innovative learning based structures have recently been proposed to tackle index and cardinality estimation tasks, specifically learned indexes and data driven cardinality estimators. These structures exhibit excellent performance in capturing data distribution, making them promising for integration into AI driven database kernels. However, accurat...
Article
Full-text available
The molecular communication system (MCS) is mainly based on the design structure of the nanodevices which are employed as nano-transmitter (Nano-TX) and nano-receiver (Nano-RX), owing to the limited drug-reservoir capacity. The current work addresses the physical design of such nanodevices and the coordination of molecular communication to accompli...
Article
Full-text available
In the era of Big Data, integrating information from multiple sources has proven valuable in various fields. To ensure a high-quality supply of multi-source data, repairing different types of errors in the multi-source data becomes critical. This paper categorizes errors in multi-source data into entity information overlapping, attribute value conf...
Chapter
Anomaly detection of time series data is an important and popular problem in both research and application fields. Kinds of solutions have been developed to uncover the anomaly instances from data. However, the labelled data is always limited and costly for real applications, which adds to the difficulty of identifying various anomalies in multivar...
Chapter
Data quality problems are seriously prevalent in time series data, and the data suffer from types of errors including single-point errors, continuous errors, and contextual errors. Since it is challenging to achieve high accuracy and efficiency in error detection tasks for time series data, we develop error detection system MEDetect in Cleanits, a...
Chapter
With the development of the Internet of Things, the time series data generated by monitors, analyzers, and detection instruments in the industry has surged. The management of very large-scale time series data faces great challenges. However, the current distributed time series database is still poor in terms of data storage efficiency and data writ...
Chapter
Various stability analyses of the power system are based on the results of power flow calculation, which is not always convergent. In practice, large manual efforts are required to be repeated many times by electrical engineers to ensure the convergence of power flow calculation, which consumes much manpower and time cost. Motivated by this, we dev...
Preprint
Full-text available
Multivariate time series classification (MTSC) is an important data mining task, which can be effectively solved by popular deep learning technology. Unfortunately, the existing deep learning-based methods neglect the hidden dependencies in different dimensions and also rarely consider the unique dynamic features of time series, which lack sufficie...
Preprint
Full-text available
Machine learning has emerged as a powerful tool for time series analysis. Existing methods are usually customized for different analysis tasks and face challenges in tackling practical problems such as partial labeling and domain shift. To achieve universal analysis and address the aforementioned problems, we develop UniTS, a novel framework that i...
Preprint
Neural predictors currently show great potential in the performance evaluation phase of neural architecture search (NAS). Despite their efficiency in the evaluation process, it is challenging to train the predictor with fewer architecture evaluations for efficient NAS. However, most of the current approaches are more concerned with improving the st...
Preprint
Hierarchical data storage is crucial for cloud-edge-end time-series database. Efficient hierarchical storage will directly reduce the storage space of local databases at each side and improve the access hit rate of data. However, no effective hierarchical data management strategy for cloud-edge-end time-series database has been proposed. To solve t...
Preprint
Full-text available
This paper studies how to develop accurate and interpretable time series classification (TSC) models with the help of external data in a privacy-preserving federated learning (FL) scenario. To the best of our knowledge, we are the first to study on this essential topic. Achieving this goal requires us to seamlessly integrate the techniques from mul...
Article
Full-text available
In recent years, the problem of traffic congestion has become a hot topic. Accurate traffic flow prediction methods have received extensive attention from many researchers all over the world. Although many methods proposed at present have achieved good results in the field of traffic flow prediction, most of them only consider the static characteri...
Article
Recent several years have witnessed the rapid explosion of artificial intelligence applied in various domains with the surpassing human-level performance. Despite the success, these models' underlying mechanisms remain a mystery, as their complicated representations make human understanding impossible. This mystery may cause discrimination and non-...
Article
In recent years, many spatial-temporal graph convolutional network (STGCN) models are proposed to deal with the spatial-temporal network data forecasting problem. These STGCN models have their own advantages, i.e., each of them puts forward many effective operations and achieves good prediction results in the real applications. If users can effecti...
Article
With the growing convergence of artificial intelligence and daily life scenarios, the application scenarios for intelligent decision methods are becoming increasingly complex. The development of various machine learning algorithms has benefited all disciplines of study, but choosing which algorithm is most suitable for a certain problem among a lar...
Article
Full-text available
Data cleaning is considered as an effective approach of improving data quality in order to help practitioners and researchers be devoted to downstream analysis and decision-making without worrying about data trustworthiness. This paper provides a systematic summary of the two main stages of data cleaning for Internet of Things (IoT) data with time...
Chapter
The big spatial temporal data captured from technology tools produce massive amount of trajectories data collected from GPS devices. The top-k query was proposed by many researchers, on which they used distance and text parameters for processing. However, the information related to text parameter like activity is always not presented due to some re...
Preprint
Full-text available
Recent years, the database community has attempted to develop learned cardinality estimator for improving the estimation accuracy. Although some researches show that the applying deep learning to cardinality estimation is a significant and promising direction, there still exists many problems in implementing these techniques to real applications (l...
Chapter
To accelerate learning process with few samples, meta-learning resorts to prior knowledge from previous tasks. However, the inconsistent task distribution and heterogeneity is hard to be handled through a global sharing model initialization. In this paper, based on gradient-based meta-learning, we propose an ensemble embedded meta-learning algorith...
Article
In the literature, many algorithms have been proposed for finding cutnodes on undirected graphs, since cutnodes are crucial to graph connectivity. Here, a cutnode of an undirected graph G is a node of G, whose deletion will cause a reachable pair of the other nodes in G to be unreachable. Currently, the difficulty of maintaining the entire G in the...
Preprint
Learning rate adaptation is a popular topic in machine learning. Gradient Descent trains neural nerwork with a fixed learning rate. Learning rate adaptation is proposed to accelerate the training process through adjusting the step size in the training session. Famous works include Momentum, Adam and Hypergradient. Hypergradient is the most special...
Preprint
Full-text available
p>Trajectory Prediction (TP) is an important research topic in computer vision and robotics fields. Recently, many stochastic TP models have been proposed to deal with this problem and have achieved better performance than the traditional models with deterministic trajectory outputs. However, these stochastic models can generate a number of future t...
Article
Full-text available
As one of the most important research topics in the field of natural language processing, open information extraction has achieved gratifying research findings in recent years. Even if so much effort is put into the work of open information extraction, there are still many shortcomings and great room for improvement in the existing system. The trad...
Article
Full-text available
We demonstrate FedTSC, a novel federated learning (FL) system for interpretable time series classification (TSC). FedTSC is an FL-based TSC solution that makes a great balance among security, interpretability, accuracy, and efficiency. We achieve this by first extending the concept of FL to consider both stronger security and model interpretability...
Chapter
We have witnessed the rapid evolution of data intelligence benefiting the decision making of complex multi-equipment systems. Collected by sensors on the equipment temporally, such data indicates the opportunity of real-time analysis and workflow optimization, while bringing data quality challenges to the specialists. The usage of low quality data...
Article
The knowledge graph (KG) has attracted much concern due to its positive effect on AI-related applications. While even for those large-scale KGs widely used, they are still far from being complete and comprehensive. This prompts reasoning over KGs to be one of the most basic and attention-grabbing tasks. However, most existing reasoning methods only...
Preprint
Nowadays, graph becomes an increasingly popular model in many real applications. The efficiency of graph storage is crucial for these applications. Generally speaking, the tune tasks of graph storage rely on the database administrators (DBAs) to find the best graph storage. However, DBAs make the tune decisions by mainly relying on their experience...
Preprint
To accelerate learning process with few samples, meta-learning resorts to prior knowledge from previous tasks. However, the inconsistent task distribution and heterogeneity is hard to be handled through a global sharing model initialization. In this paper, based on gradient-based meta-learning, we propose an ensemble embedded meta-learning algorith...
Chapter
In the field of knowledge discovery for multi-source homogeneous data, for an entity, its correct value is found by resolving conflicts among multiple sources of information. However, due to missing values and inefficient entity matching, a single entity’s information is often insufficient in practical applications. This phenomenon requires pattern...
Chapter
It is unavoidable that errors occur in databases. Reasons include recording errors, stale data, and even intentional errors. Such mistakes may cause serious consequences. It is impossible to correct those errors manually at scale. In fact, it is hard for people to even detect errors. However, since errors often occur rather randomly, they may cause...
Chapter
Fact extraction, which aims to extract (entity, attribute, value)-tuples from massive text corpora, is crucial in text data mining. Recent approaches focus on extracting facts by mining textual patterns with semantic types, where the quality of a pattern is evaluated based on content-based criteria, such as frequency. However, these approaches over...
Chapter
Aggregating accurate information from multi-source conflicting data is crucial. A common approach to address this problem is Voting/Averaging. However, such methods usually fail to achieve correct results, since they assume that all the sources are equally reliable. In most cases, the information quality usually varies a lot among diversified sourc...
Conference Paper
Full-text available
Although the Trajectory Prediction (TP) model has achieved great success in computer vision and robotics fields, its architecture and training scheme design rely on heavy manual work and domain knowledge, which is not friendly to common users. Besides, the existing works ignore Federated Learning (FL) scenarios, failing to make full use of distribut...
Preprint
Meta-learning is used to efficiently enable the automatic selection of machine learning models by combining data and prior knowledge. Since the traditional meta-learning technique lacks explainability, as well as shortcomings in terms of transparency and fairness, achieving explainability for meta-learning is crucial. This paper proposes FIND, an i...
Article
Entity matching (EM) aims to identify whether two records refer to the same underlying real-world entity. Traditional entity matching methods mainly focus on structured data, where the attribute values are short and atomic. Recently, there has been an increasing demand for matching textual records, such as matching descriptions of products that cor...
Article
Partial Multi-Label Learning (PML) aims to learn a robust multi-label classifier from training data, where each instance is associated with a set of candidate labels, among which only a subset of them is relevant. Some existing methods consider the noise in the feature space and have made some achievements. However, they ignored that each label mig...
Chapter
Full-text available
In recent years, many automated machine learning (AutoML) techniques are proposed for the automatic selection or design machine learning models. They bring great convenience to the use of machine learning techniques, but are difficult for users without programming experiences to use, and lack of effective optimization scheme to respond to users’ di...
Preprint
Automatic Time Series Forecasting (TSF) model design which aims to help users to efficiently design suitable forecasting model for the given time series data scenarios, is a novel research topic to be urgently solved. In this paper, we propose AutoTS algorithm trying to utilize the existing design skills and design efficient search methods to effec...
Article
Full-text available
Effective knowledge graph storage management is identified as the basic premise to make full use of knowledge graphs. Due to the lack of performance evaluation for knowledge graph stores, it is difficult for users to decide which one is the best. However, none of existing studies of performance prediction focuses on storage structures. To fill this...
Article
Full-text available
Knowledge is the cornerstone of artificial intelligence, which is often represented as RDF graphs. The large-scale RDF graphs in various fields pose new challenges to graph data management. Due to the maturity and stability, relational database is a good choice for RDF graph storage. However, the management of the complex structure of RDF graphs in...
Article
Full-text available
Nowadays, sampling-based Approximate Query Processing (AQP) is widely regarded as a promising way to achieve interactivity in big data analytics. To build such an AQP system, finding the minimal sample size for a query regarding given error constraints in general, called Sample Size Optimization (SSO), is an essential yet unsolved problem. Ideally,...
Article
An effective Community Scoring Function (CSF) is very important since it can properly quantify the community quality of the node groups and helped us to effectively discover valuable network communities. Currently, researchers have proposed many types of CSFs. However, none of them were based on an experimental and theoretical analysis of the node...
Article
As graphs grow in size, many real-world graphs are difficult to load into the primary memory of a computer. Thus, computing depth-first search (DFS) results (i.e., depth-first order or DFS-Tree) on the semi-external memory model is important to investigate. Semi-external algorithms assume that the primary memory can at least hold a spanning tree T...
Preprint
Full-text available
Dynamic relation repair aims to efficiently validate and repair the instances for knowledge graph enhancement (KGE), where KGE captures missing relations from unstructured data and leads to noisy facts to the knowledge graph. With the prosperity of unstructured data, an online approach is asked to clean the new RDF tuples before adding them to the...
Article
Full-text available
Sensors, while more widely implemented in industry, have generated a large number of high-dimension unlabeled time series data during the process of the complicated producing. If putting these data to use, we can predict and preclude malfunctions of specific industrial facilities so that there will be less pecuniary lost. In this paper, we propose...
Article
With years of development, machine learning algorithms have excellent performance in some tasks of data analysis and data mining. To apply machine learning to new tasks, suitable algorithm and hyperparameters selection techniques, which is known as Combined Algorithm Selection and Hyperparameter optimization problem, are in demand. In the field of...

Network

Cited By