Input and output trees during query optimization. a Logical input tree, b non-optimal physical tree, c optimal physical tree

Source publication

SCOPE: parallel databases meet MapReduce

Article

Full-text available

Oct 2012

Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets, such as search logs, click streams, and web graph data. For cost and performance reasons, processing is typically done on large clusters of tens of thousands of commodity machines. Such massive data analysis on large clusters presents new opp...

Context 1

... TO "S.ss" CLUSTERED BY x SORTED BY x,y; Figure 4a shows a tree of logical operators that specifies, in an almost one-to-one correspondence, the relational alge- bra representation of the script above. Figure 4b, c show two different physical operator trees corresponding to alternative execution plans for the same script. ...

View in full-text

Context 2

View in full-text

Context 3

... illustrate the importance of script optimization by con- trasting Fig. 4b, c. In the straightforward execution plan in Fig. 4b, table R is first hash-partitioned on columns {R.x, R.y}. This step is done in parallel, with each machine processing its portion of the table. Similarly, table S is par- titioned on {S.x, S.y}. Next, rows from matching R and S partitions are hash-joined in parallel, producing a ...

View in full-text

Context 4

View in full-text

Context 5

... data, data reshuffling through the network can pose a seri- ous performance bottleneck. The alternative plan in Fig. 4c shows a different plan where some repartition operators have been eliminated. Note that the join operator requires, for cor- rectness, that no two tuples from R (also from S) share the same (x, y) values but belong to different partitions. This can be achieved by repartitioning both inputs on any non-empty subset of {x, y}. The plan in ...

View in full-text

Context 6

... in Fig. 4c shows a different plan where some repartition operators have been eliminated. Note that the join operator requires, for cor- rectness, that no two tuples from R (also from S) share the same (x, y) values but belong to different partitions. This can be achieved by repartitioning both inputs on any non-empty subset of {x, y}. The plan in Fig. 4c repartitions data on x alone and thus (1) avoids shuffling data from S by leveraging existing partitioning and sorting properties and (2) avoids a subsequent repartitioning on x, required by the reducer oper- ator. It also uses a merge join implementation, which requires the inputs to be sorted by (x, y). This is done by sorting R and ...

View in full-text

Performance of K-means in Hadoop Using MapReduce Programming Model

Conference Paper

Full-text available

Jan 2019

Hadoop which is one of the big data framework uses MapReduce programming model to analyze data. Mahout is a data analysis library that has the ability to use MapReduce programming. One of the clustering algorithms supported by Mahout is K-mean. The researchers are interested in observing the performance speed of applying the K-mean algorithm from M...

Study on the Method of Road Transport Management Information Data Mining based on Pruning Eclat Algorithm and MapReduce

Article

Full-text available

Jul 2014

Road transport management information is a class of massive and correlation data in ITS (intelligent transportation systems), and its association rules data mining has important practical significance. In order to cover the shortage of the classical association rules optimized algorithm Eclat, this paper proposed and demonstrated that candidate set...

Using Hadoop MapReduce in a multicluster environment

Conference Paper

Full-text available

May 2013

Hadoop MapReduce has become one of the most popular tools for data processing. Hadoop is normally installed on a cluster of computers. When the cluster becomes undersized, it can be scaled by adding new computers and storage devices, but it can also be extended by real or virtual resources from another computer cluster. We present a utilization of...

Distinction of Discrete Transformations Applied to Hadoop's MapReduce

Article

Full-text available

Nov 2014

Hadoop MapReduce is an effective data processing platform for both commercial as well as academic applications. It intends the simplification of vast quantities of data as well as ease of processing in parallel on enormous clusters of hardware in a fault-tolerant and dependable approach. There are many modifications possible in the MapReduce to inc...

Fortran Code Refactoring Based on MapReduce Programming Model

Conference Paper

Jul 2023

Survey on performance optimization for database systems

Article

Full-text available

Jan 2023

The performance optimization of database systems has been widely studied for years. From the perspective of the operation and maintenance personnel, it mainly includes three topics: prediction, diagnosis, and tuning. The prediction of future performance can guide the adjustment of configurations and resources. The diagnosis of anomalies can determine the root cause of performance regression. Tuning operations improve performance by adjusting influencing factors, e.g., knobs, indexes, views, resources, and structured query language (SQL) design. In this review, we focus on the performance optimization of database systems and review notable research work on the topics of prediction, diagnosis, and tuning. For prediction, we summarize the techniques, strengths, and limitations of several proposed systems for single and concurrent queries. For diagnosis, we categorize the techniques by the input data, i.e., monitoring metrics, logs, or time metrics, and analyze their abilities. For tuning, we focus on the approaches commonly adopted by the operation and maintenance personnel, i.e., knob tuning, index selection, view materialization, elastic resource, storage management, and SQL antipattern detection. Finally, we discuss some challenges and future work.

Capítulo 8 Redefiniendo Big Data: una propuesta desde la academia

Chapter

Full-text available

Sep 2022

I. Introducción La evolución de la Web, pasando desde un producto totalmente estático de carácter unidireccional para derivar en una plataforma dinámica y bidireccional, dio origen a una infraestructura considerada como una rica fuente en generación de datos, con una gran cantidad de personas de diferentes estratos sociales que participan democráticamente en ella (Leung et al., 2019). En este contexto, el desarrollo de técnicas capaces de procesar estos grandes volúmenes de datos encuentra cabida en diferentes actividades económicas y sociales. En particular, un sector atraído por descubrir qué dicen los datos sería el económico. Con esto, el procesamiento de datos a gran escala se plantea con un enfoque comercial, logrando establecer una ventaja para aquellas empresas que son capaces de conocer en profundidad lo que sus clientes demandan (Neves y Bernardino, 2015). En este orden de ideas, la irrupción del Big Data, en complemento con las actividades que dicen relación con el análisis de estos conjuntos de datos, surge a raíz de la necesidad de almacenar y procesar Petabytes de datos (Zhou et al., 2012). Así, las técnicas desarrolladas bajo el rótulo de Big Data deben considerar una arquitectura escalable, distribuida, de alto rendimiento y tolerante a errores (Neves y Bernardino, 2015). Dado este contexto, el presente artículo pretende situar en el centro de la discusión el futuro que vemos respecto al concepto de Big Data. Para lo anterior, este manuscrito está organizado en tres secciones. La primera de ellas se debe entender como un marco de referencia en torno al concepto de Big Data. En una segunda sección se discute sobre las aplicaciones del Big Data, esto para establecer su transversalidad en diferentes contextos. Finalmente, la última sección presenta las conclusiones de los autores en torno a lo que hoy se entiende por Big Data, la justificación de su aplicación y, cuál sería el futuro para el objeto de estudio que plantea este trabajo.

Fine-Grained Modeling and Optimization for Intelligent Resource Management in Big Data Processing

Preprint

Full-text available

Jul 2022

Big data processing at the production scale presents a highly complex environment for resource optimization (RO), a problem crucial for meeting performance goals and budgetary constraints of analytical users. The RO problem is challenging because it involves a set of decisions (the partition count, placement of parallel instances on machines, and resource allocation to each instance), requires multi-objective optimization (MOO), and is compounded by the scale and complexity of big data systems while having to meet stringent time constraints for scheduling. This paper presents a MaxCompute-based integrated system to support multi-objective resource optimization via fine-grained instance-level modeling and optimization. We propose a new architecture that breaks RO into a series of simpler problems, new fine-grained predictive models, and novel optimization methods that exploit these models to make effective instance-level recommendations in a hierarchical MOO framework. Evaluation using production workloads shows that our new RO system could reduce 37-72% latency and 43-78% cost at the same time, compared to the current optimizer and scheduler, while running in 0.02-0.23s.

Auto-Tag: Tagging-Data-By-Example in Data Lakes

Preprint

Full-text available

Dec 2021

As data lakes become increasingly popular in large enterprises today, there is a growing need to tag or classify data assets (e.g., files and databases) in data lakes with additional metadata (e.g., semantic column-types), as the inferred metadata can enable a range of downstream applications like data governance (e.g., GDPR compliance), and dataset search. Given the sheer size of today's enterprise data lakes with petabytes of data and millions of data assets, it is imperative that data assets can be ``auto-tagged'', using lightweight inference algorithms and minimal user input. In this work, we develop Auto-Tag, a corpus-driven approach that automates data-tagging of \textit{custom} data types in enterprise data lakes. Using Auto-Tag, users only need to provide \textit{one} example column to demonstrate the desired data-type to tag. Leveraging an index structure built offline using a lightweight scan of the data lake, which is analogous to pre-training in machine learning, Auto-Tag can infer suitable data patterns to best ``describe'' the underlying ``domain'' of the given column at an interactive speed, which can then be used to tag additional data of the same ``type'' in data lakes. The Auto-Tag approach can adapt to custom data-types, and is shown to be both accurate and efficient. Part of Auto-Tag ships as a ``custom-classification'' feature in a cloud-based data governance and catalog solution \textit{Azure Purview}.

Parallel Query Processing in a Polystore

Article

Full-text available

Dec 2021
DISTRIB PARALLEL DAT

The blooming of different data stores has made polystores a major topic in the cloud and big data landscape. As the amount of data grows rapidly, it becomes critical to exploit the inherent parallel processing capabilities of underlying data stores and data processing platforms. To fully achieve this, a polystore should: (i) preserve the expressivity of each data store’s native query or scripting language and (ii) leverage a distributed architecture to enable parallel data integration, i.e. joins, on top of parallel retrieval of underlying partitioned datasets. In this paper, we address these points by: (i) using the polyglot approach of the CloudMdsQL query language that allows native queries to be expressed as inline scripts and combined with SQL statements for ad-hoc integration and (ii) incorporating the approach within the LeanXcale distributed query engine, thus allowing for native scripts to be processed in parallel at data store shards. In addition, (iii) efficient optimization techniques, such as bind join, can take place to improve the performance of selective joins. We evaluate the performance benefits of exploiting parallelism in combination with high expressivity and optimization through our experimental validation.

Phoebe: A Learning-based Checkpoint Optimizer

Preprint

Full-text available

Oct 2021

Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft. To address these issues, we propose Phoebe, an efficient learning-based checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For each stage of a job, Phoebe makes accurate predictions for: (1) the execution time, (2) the output size, and (3) the start/end time taking into account the inter-stage dependencies. Using these predictions, we formulate checkpoint optimization as an integer programming problem and propose a scalable heuristic algorithm that meets the latency requirement of the production environment. We demonstrate the effectiveness of Phoebe in production workloads, and show that we can free the temporary storage on hotspots by more than 70% and restart failed jobs 68% faster on average with minimum performance impact. Phoebe also illustrates that adding multiple sets of checkpoints is not cost-efficient, which dramatically reduces the complexity of the optimization.

Data-induced predicates for sideways information passing in query optimizers

Article

Full-text available

Aug 2021
VLDB J

Using data statistics, we convert predicates on a table into data-induced predicates (diPs) that apply on the joining tables. Doing so substantially speeds up multi-relation queries because the benefits of predicate pushdown can now apply beyond just the tables that have predicates. We use diPs to skip data exclusively during query optimization; i.e., diPs lead to better plans and have no overhead during query execution. We study how to apply diPs for complex query expressions and how the usefulness of diPs varies with the data statistics used to construct diPs and the data distributions. Our results show that building diPs using zone-maps which are already maintained in today’s clusters leads to sizable data skipping gains. Using a new (slightly larger) statistic, 50% of the queries in the TPC-H, TPC-DS and JoinOrder benchmarks can skip at least 33% of the query input. Consequently, the median query in a production big-data cluster finishes roughly 2×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2\times $$\end{document} faster.

面向实时视频流分析的边缘计算技术

Article

Full-text available

Jul 2021

实时视频流分析在智能监控、智慧城市、自动驾驶等场景中具有重要价值. 然而计算负载高、带宽需求大、延迟要求严等特点使得实时视频流分析难以通过传统的云计算范式进行部署. 近年来兴起的边缘计算范式, 将计算任务从云端下沉到位于网络边缘的终端设备和边缘服务器上, 能够有效解决上述问题. 因此, 许多针对实时视频流分析的边缘计算研究逐渐涌现. 本文首先介绍了智能视频流分析和边缘计算的背景知识, 以及二者结合的典型应用场景; 接着提出了现有系统所关注的衡量指标和面临的挑战; 然后从终端设备层次、协作层次、边缘/云层次对本领域的关键技术分别进行了详细的介绍, 重点涉及了模型压缩和选择、本地缓存、视频帧过滤、任务卸载、网络协议、隐私保护、查询优化、推理加速和边缘缓存技术. 基于对上述各项核心技术的有机整合, 本文提出了基于边缘计算的视频大数据智能分析平台 Argus, 从数据采集、推理分析, 到数据挖掘、日志管理, 对实时视频流分析全生命周期提供支持, 并成功应用到智慧油田中. 最后, 本文讨论了本领域尚待解决的问题和未来研究方向, 希望为今后的研究工作提供有益参考.

KEA: Tuning an Exabyte-Scale Data Infrastructure

Preprint

Full-text available

Jun 2021

Microsoft's internal big-data infrastructure is one of the largest in the world -- with over 300k machines running billions of tasks from over 0.6M daily jobs. Operating this infrastructure is a costly and complex endeavor, and efficiency is paramount. In fact, for over 15 years, a dedicated engineering team has tuned almost every aspect of this infrastructure, achieving state-of-the-art efficiency (>60% average CPU utilization across all clusters). Despite rich telemetry and strong expertise, faced with evolving hardware/software/workloads this manual tuning approach had reached its limit -- we had plateaued. In this paper, we present KEA, a multi-year effort to automate our tuning processes to be fully data/model-driven. KEA leverages a mix of domain knowledge and principled data science to capture the essence of our cluster dynamic behavior in a set of machine learning (ML) models based on collected system data. These models power automated optimization procedures for parameter tuning, and inform our leadership in critical decisions around engineering and capacity management (such as hardware and data center design, software investments, etc.). We combine "observational" tuning (i.e., using models to predict system behavior without direct experimentation) with judicious use of "flighting" (i.e., conservative testing in production). This allows us to support a broad range of applications that we discuss in this paper. KEA continuously tunes our cluster configurations and is on track to save Microsoft tens of millions of dollars per year. At the best of our knowledge, this paper is the first to discuss research challenges and practical learnings that emerge when tuning an exabyte-scale data infrastructure.

Input and output trees during query optimization. a Logical input tree, b non-optimal physical tree, c optimal physical tree

Contexts in source publication

Similar publications

Citations