Fig 4 - uploaded by Per-Åke Larson
Content may be subject to copyright.
Input and output trees during query optimization. a Logical input tree, b non-optimal physical tree, c optimal physical tree

Input and output trees during query optimization. a Logical input tree, b non-optimal physical tree, c optimal physical tree

Source publication
Article
Full-text available
Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets, such as search logs, click streams, and web graph data. For cost and performance reasons, processing is typically done on large clusters of tens of thousands of commodity machines. Such massive data analysis on large clusters presents new opp...

Contexts in source publication

Context 1
... TO "S.ss" CLUSTERED BY x SORTED BY x,y; Figure 4a shows a tree of logical operators that specifies, in an almost one-to-one correspondence, the relational alge- bra representation of the script above. Figure 4b, c show two different physical operator trees corresponding to alternative execution plans for the same script. ...
Context 2
... TO "S.ss" CLUSTERED BY x SORTED BY x,y; Figure 4a shows a tree of logical operators that specifies, in an almost one-to-one correspondence, the relational alge- bra representation of the script above. Figure 4b, c show two different physical operator trees corresponding to alternative execution plans for the same script. ...
Context 3
... illustrate the importance of script optimization by con- trasting Fig. 4b, c. In the straightforward execution plan in Fig. 4b, table R is first hash-partitioned on columns {R.x, R.y}. This step is done in parallel, with each machine processing its portion of the table. Similarly, table S is par- titioned on {S.x, S.y}. Next, rows from matching R and S partitions are hash-joined in parallel, producing a ...
Context 4
... illustrate the importance of script optimization by con- trasting Fig. 4b, c. In the straightforward execution plan in Fig. 4b, table R is first hash-partitioned on columns {R.x, R.y}. This step is done in parallel, with each machine processing its portion of the table. Similarly, table S is par- titioned on {S.x, S.y}. Next, rows from matching R and S partitions are hash-joined in parallel, producing a result par- titioned on {R.x, R.y} (or {S.x, S.y}). In ...
Context 5
... data, data reshuffling through the network can pose a seri- ous performance bottleneck. The alternative plan in Fig. 4c shows a different plan where some repartition operators have been eliminated. Note that the join operator requires, for cor- rectness, that no two tuples from R (also from S) share the same (x, y) values but belong to different partitions. This can be achieved by repartitioning both inputs on any non-empty subset of {x, y}. The plan in ...
Context 6
... in Fig. 4c shows a different plan where some repartition operators have been eliminated. Note that the join operator requires, for cor- rectness, that no two tuples from R (also from S) share the same (x, y) values but belong to different partitions. This can be achieved by repartitioning both inputs on any non-empty subset of {x, y}. The plan in Fig. 4c repartitions data on x alone and thus (1) avoids shuffling data from S by leveraging existing partitioning and sorting properties and (2) avoids a subsequent repartitioning on x, required by the reducer oper- ator. It also uses a merge join implementation, which requires the inputs to be sorted by (x, y). This is done by sorting R and ...

Similar publications

Conference Paper
Full-text available
Hadoop which is one of the big data framework uses MapReduce programming model to analyze data. Mahout is a data analysis library that has the ability to use MapReduce programming. One of the clustering algorithms supported by Mahout is K-mean. The researchers are interested in observing the performance speed of applying the K-mean algorithm from M...
Article
Full-text available
Road transport management information is a class of massive and correlation data in ITS (intelligent transportation systems), and its association rules data mining has important practical significance. In order to cover the shortage of the classical association rules optimized algorithm Eclat, this paper proposed and demonstrated that candidate set...
Conference Paper
Full-text available
Hadoop MapReduce has become one of the most popular tools for data processing. Hadoop is normally installed on a cluster of computers. When the cluster becomes undersized, it can be scaled by adding new computers and storage devices, but it can also be extended by real or virtual resources from another computer cluster. We present a utilization of...
Article
Full-text available
Hadoop MapReduce is an effective data processing platform for both commercial as well as academic applications. It intends the simplification of vast quantities of data as well as ease of processing in parallel on enormous clusters of hardware in a fault-tolerant and dependable approach. There are many modifications possible in the MapReduce to inc...

Citations

... Existing research has built converters for several SQL declarative languages and integrated MapReduce to support these languages, include Pig Latin/Pig [11][12], SCOPE [13][14], HadoopDB [15], Hive [16], YSart [17], and Jqal [18]. At present, some scholars have proposed some methods and tools for refactoring programming language into MapReduce code. ...
... CLOUDVIEWS [72]. CLOUDVIEWS is a computation reuse framework for Microsoft's SCOPE job service [112,113]. It supports online workloads with a feedback loop for runtime statistics, and focuses on recurring workloads. ...
Article
Full-text available
The performance optimization of database systems has been widely studied for years. From the perspective of the operation and maintenance personnel, it mainly includes three topics: prediction, diagnosis, and tuning. The prediction of future performance can guide the adjustment of configurations and resources. The diagnosis of anomalies can determine the root cause of performance regression. Tuning operations improve performance by adjusting influencing factors, e.g., knobs, indexes, views, resources, and structured query language (SQL) design. In this review, we focus on the performance optimization of database systems and review notable research work on the topics of prediction, diagnosis, and tuning. For prediction, we summarize the techniques, strengths, and limitations of several proposed systems for single and concurrent queries. For diagnosis, we categorize the techniques by the input data, i.e., monitoring metrics, logs, or time metrics, and analyze their abilities. For tuning, we focus on the approaches commonly adopted by the operation and maintenance personnel, i.e., knob tuning, index selection, view materialization, elastic resource, storage management, and SQL antipattern detection. Finally, we discuss some challenges and future work.
... En este orden de ideas, la irrupción del Big Data, en complemento con las actividades que dicen relación con el análisis de estos conjuntos de datos, surge a raíz de la necesidad de almacenar y procesar Petabytes de datos (Zhou et al., 2012). Así, las técnicas desarrolladas bajo el rótulo de Big Data deben considerar una arquitectura escalable, distribuida, de alto rendimiento y tolerante a errores (Neves y Bernardino, 2015). ...
Chapter
Full-text available
I. Introducción La evolución de la Web, pasando desde un producto totalmente estático de carácter unidireccional para derivar en una plataforma dinámica y bidireccional, dio origen a una infraestructura considerada como una rica fuente en generación de datos, con una gran cantidad de personas de diferentes estratos sociales que participan democráticamente en ella (Leung et al., 2019). En este contexto, el desarrollo de técnicas capaces de procesar estos grandes volúmenes de datos encuentra cabida en diferentes actividades económicas y sociales. En particular, un sector atraído por descubrir qué dicen los datos sería el económico. Con esto, el procesamiento de datos a gran escala se plantea con un enfoque comercial, logrando establecer una ventaja para aquellas empresas que son capaces de conocer en profundidad lo que sus clientes demandan (Neves y Bernardino, 2015). En este orden de ideas, la irrupción del Big Data, en complemento con las actividades que dicen relación con el análisis de estos conjuntos de datos, surge a raíz de la necesidad de almacenar y procesar Petabytes de datos (Zhou et al., 2012). Así, las técnicas desarrolladas bajo el rótulo de Big Data deben considerar una arquitectura escalable, distribuida, de alto rendimiento y tolerante a errores (Neves y Bernardino, 2015). Dado este contexto, el presente artículo pretende situar en el centro de la discusión el futuro que vemos respecto al concepto de Big Data. Para lo anterior, este manuscrito está organizado en tres secciones. La primera de ellas se debe entender como un marco de referencia en torno al concepto de Big Data. En una segunda sección se discute sobre las aplicaciones del Big Data, esto para establecer su transversalidad en diferentes contextos. Finalmente, la última sección presenta las conclusiones de los autores en torno a lo que hoy se entiende por Big Data, la justificación de su aplicación y, cuál sería el futuro para el objeto de estudio que plantea este trabajo.
... While big data query processing has become commonplace in enterprise businesses and many platforms have been developed for this purpose [3,5,7,9,12,26,29,42,49,57,61,65,66], resource optimization in large clusters [14,34,36,37] has received less attention. However, we observe from real-world experiences of running large compute clusters in the Alibaba Cloud that resource management plays a vital role in meeting both performance goals and budgetary constraints of internal and external analytical users. ...
... It further improves accuracy by using profiling features such as data-sharing, data-conflict and resource competition from the local DBMS. In our work, however, concurrency information regarding the operators from different queries is unavailable due to the container technology in large clusters [26,49,61,66]. ...
Preprint
Full-text available
Big data processing at the production scale presents a highly complex environment for resource optimization (RO), a problem crucial for meeting performance goals and budgetary constraints of analytical users. The RO problem is challenging because it involves a set of decisions (the partition count, placement of parallel instances on machines, and resource allocation to each instance), requires multi-objective optimization (MOO), and is compounded by the scale and complexity of big data systems while having to meet stringent time constraints for scheduling. This paper presents a MaxCompute-based integrated system to support multi-objective resource optimization via fine-grained instance-level modeling and optimization. We propose a new architecture that breaks RO into a series of simpler problems, new fine-grained predictive models, and novel optimization methods that exploit these models to make effective instance-level recommendations in a hierarchical MOO framework. Evaluation using production workloads shows that our new RO system could reduce 37-72% latency and 43-78% cost at the same time, compared to the current optimizer and scheduler, while running in 0.02-0.23s.
... We implement our offline indexing algorithm in a Map-Reduce-like language called Scope [63] with UDFs in C#, executed on a production cluster [19]. ...
... Data set. We evaluate algorithms using a real corpus T with 7.2M data columns, crawled from a production data lake at Microsoft [63]. ...
Preprint
Full-text available
As data lakes become increasingly popular in large enterprises today, there is a growing need to tag or classify data assets (e.g., files and databases) in data lakes with additional metadata (e.g., semantic column-types), as the inferred metadata can enable a range of downstream applications like data governance (e.g., GDPR compliance), and dataset search. Given the sheer size of today's enterprise data lakes with petabytes of data and millions of data assets, it is imperative that data assets can be ``auto-tagged'', using lightweight inference algorithms and minimal user input. In this work, we develop Auto-Tag, a corpus-driven approach that automates data-tagging of \textit{custom} data types in enterprise data lakes. Using Auto-Tag, users only need to provide \textit{one} example column to demonstrate the desired data-type to tag. Leveraging an index structure built offline using a lightweight scan of the data lake, which is analogous to pre-training in machine learning, Auto-Tag can infer suitable data patterns to best ``describe'' the underlying ``domain'' of the given column at an interactive speed, which can then be used to tag additional data of the same ``type'' in data lakes. The Auto-Tag approach can adapt to custom data-types, and is shown to be both accurate and efficient. Part of Auto-Tag ships as a ``custom-classification'' feature in a cloud-based data governance and catalog solution \textit{Azure Purview}.
... In addition, it allows implementing custom extractors, processors and reducers and combining operators for manipulating rowsets. SCOPE has been extended to combine SQL and MapReduce operators in a single language [24]. These systems are used over a single distributed storage system and therefore do not address the problem of integrating a number of diverse data stores. ...
Article
Full-text available
The blooming of different data stores has made polystores a major topic in the cloud and big data landscape. As the amount of data grows rapidly, it becomes critical to exploit the inherent parallel processing capabilities of underlying data stores and data processing platforms. To fully achieve this, a polystore should: (i) preserve the expressivity of each data store’s native query or scripting language and (ii) leverage a distributed architecture to enable parallel data integration, i.e. joins, on top of parallel retrieval of underlying partitioned datasets. In this paper, we address these points by: (i) using the polyglot approach of the CloudMdsQL query language that allows native queries to be expressed as inline scripts and combined with SQL statements for ad-hoc integration and (ii) incorporating the approach within the LeanXcale distributed query engine, thus allowing for native scripts to be processed in parallel at data store shards. In addition, (iii) efficient optimization techniques, such as bind join, can take place to improve the performance of selective joins. We evaluate the performance benefits of exploiting parallelism in combination with high expressivity and optimization through our experimental validation.
... It consists of hundreds of thousands of machines executing hundreds of thousands of jobs per day [42,45]. Cosmos users submit their analytical jobs using SCOPE [9,62], a SQL-like data flow dialect. SCOPE jobs are compiled into a Direct Acyclic Graph (DAG) of stages which in turn are executed in parallel by a YARN-based scheduler [13]. ...
Preprint
Full-text available
Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft. To address these issues, we propose Phoebe, an efficient learning-based checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For each stage of a job, Phoebe makes accurate predictions for: (1) the execution time, (2) the output size, and (3) the start/end time taking into account the inter-stage dependencies. Using these predictions, we formulate checkpoint optimization as an integer programming problem and propose a scalable heuristic algorithm that meets the latency requirement of the production environment. We demonstrate the effectiveness of Phoebe in production workloads, and show that we can free the temporary storage on hotspots by more than 70% and restart failed jobs 68% faster on average with minimum performance impact. Phoebe also illustrates that adding multiple sets of checkpoints is not cost-efficient, which dramatically reduces the complexity of the optimization.
... On the right, we show the plan generated using magic-set transformations which pull group-by above the join. diPs complement magic-set transformations; we see here that magicset transformations cannot skip partitions of lineitem but because group-by has been pulled above the join, moving diPs sideways once is enough unlike the case in Fig. 2(left) [92]. Recall that the predicate columns are only available in the part table. ...
... Bloom filters record set membership [42]. However, we find them to be less useful here because the partition sizes used in practical distributed storage systems (e.g., ∼ 100MBs of data [43,92]) result in millions of distinct values per column in each partition, especially when join columns are keys. To record large sets, bloom filters require large space or they will have a high false positive rate; e.g., a 1KB bloom filter that records a million distinct values will have 99.62% false positives [42] leading to almost no data skipping. ...
... To evaluate behavior more broadly, we generate several other layouts where each table is ordered on a randomly chosen column. For each data layout, we partition the data as recommended by the storage system, i.e., roughly 100MB of content in SCOPE clusters, [48,92] and roughly 1M rows per columnstore segment in SQL Server [8]. ...
Article
Full-text available
Using data statistics, we convert predicates on a table into data-induced predicates (diPs) that apply on the joining tables. Doing so substantially speeds up multi-relation queries because the benefits of predicate pushdown can now apply beyond just the tables that have predicates. We use diPs to skip data exclusively during query optimization; i.e., diPs lead to better plans and have no overhead during query execution. We study how to apply diPs for complex query expressions and how the usefulness of diPs varies with the data statistics used to construct diPs and the data distributions. Our results show that building diPs using zone-maps which are already maintained in today’s clusters leads to sizable data skipping gains. Using a new (slightly larger) statistic, 50% of the queries in the TPC-H, TPC-DS and JoinOrder benchmarks can skip at least 33% of the query input. Consequently, the median query in a production big-data cluster finishes roughly 2×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2\times $$\end{document} faster.
... Summary and comparison of selected studies of collaboration level technologiesOptasia[141] 对数据流处理引擎 SCOPE[171] 进行了扩展, 使之支持多种基础的视频分析模块. 具 体来说, Optasia 将特征提取、分类、检测等单帧操作作为处理算子 (processor); 将背景去除、目标追 踪等连续视频流操作作为规约算子 (reducer); 将目标重识别等跨视频流操作作为合并算子 (combiner) 整合进了 SCOPE 系统中, 实现了类结构化查询语言 (structured query language, SQL) 接口. ...
Article
Full-text available
实时视频流分析在智能监控、智慧城市、自动驾驶等场景中具有重要价值. 然而计算负载高、带宽需求大、延迟要求严等特点使得实时视频流分析难以通过传统的云计算范式进行部署. 近年来兴起的边缘计算范式, 将计算任务从云端下沉到位于网络边缘的终端设备和边缘服务器上, 能够有效解决上述问题. 因此, 许多针对实时视频流分析的边缘计算研究逐渐涌现. 本文首先介绍了智能视频流分析和边缘计算的背景知识, 以及二者结合的典型应用场景; 接着提出了现有系统所关注的衡量指标和面临的挑战; 然后从终端设备层次、协作层次、边缘/云层次对本领域的关键技术分别进行了详细的介绍, 重点涉及了模型压缩和选择、本地缓存、视频帧过滤、任务卸载、网络协议、隐私保护、查询优化、推理加速和边缘缓存技术. 基于对上述各项核心技术的有机整合, 本文提出了基于边缘计算的视频大数据智能分析平台 Argus, 从数据采集、推理分析, 到数据挖掘、日志管理, 对实时视频流分析全生命周期提供支持, 并成功应用到智慧油田中. 最后, 本文讨论了本领域尚待解决的问题和未来研究方向, 希望为今后的研究工作提供有益参考.
... Table 1 summarizes several scalability metrics of this massive infrastructure. The vast majority of the submitted jobs are written in Scope [59], a SQL-like dialect (with heavy use of C# and Python UDFs). Scope jobs are translated to a DAG of operators that are spread for execution across several machines. ...
Preprint
Full-text available
Microsoft's internal big-data infrastructure is one of the largest in the world -- with over 300k machines running billions of tasks from over 0.6M daily jobs. Operating this infrastructure is a costly and complex endeavor, and efficiency is paramount. In fact, for over 15 years, a dedicated engineering team has tuned almost every aspect of this infrastructure, achieving state-of-the-art efficiency (>60% average CPU utilization across all clusters). Despite rich telemetry and strong expertise, faced with evolving hardware/software/workloads this manual tuning approach had reached its limit -- we had plateaued. In this paper, we present KEA, a multi-year effort to automate our tuning processes to be fully data/model-driven. KEA leverages a mix of domain knowledge and principled data science to capture the essence of our cluster dynamic behavior in a set of machine learning (ML) models based on collected system data. These models power automated optimization procedures for parameter tuning, and inform our leadership in critical decisions around engineering and capacity management (such as hardware and data center design, software investments, etc.). We combine "observational" tuning (i.e., using models to predict system behavior without direct experimentation) with judicious use of "flighting" (i.e., conservative testing in production). This allows us to support a broad range of applications that we discuss in this paper. KEA continuously tunes our cluster configurations and is on track to save Microsoft tens of millions of dollars per year. At the best of our knowledge, this paper is the first to discuss research challenges and practical learnings that emerge when tuning an exabyte-scale data infrastructure.