Independents vs dependents.

Source publication

Managing Cloud Infrastructures by a Multi-Layer Data Analytics

Conference Paper

Full-text available

Jul 2016

Today's IT management faces the problem of " virtualized big environments " with hundreds of thousands of objects/resources as virtual machines, hosts, clusters, etc., evolving into cloud services. Admins of those infrastructures heavily rely on smart data-agnostic approaches to get reliable and accurate information regarding any current or upcomin...

Context 1

... 1 and 2 illustrate the volumes of the categories we got for this environment. Fig. 1 shows the percentage of constants, dependents, and independents against all metrics, and Fig. 2 shows the percentages of dependents and independents against variable ...

View in full-text

cCube: A Cloud Microservices Architecture for Evolutionary Machine Learning Classification

Conference Paper

Full-text available

Jul 2017

We present cCube, a microservices open source architecture used to automatically create an application of one or more Evolutionary Machine Learning (EML) classification algorithms that can be deployed to the cloud with automatic data factorization, training, result filtering and fusion.

A Comparative Taxonomy and Survey of Public Cloud Infrastructure Vendors

Article

Full-text available

Oct 2017

An increasing number of technology enterprises are adopting cloud-native architectures to offer their web-based products, by moving away from privately-owned data-centers and relying exclusively on cloud service providers. As a result, cloud vendors have lately increased, along with the estimated annual revenue they share. However, in the process o...

Privacy-preserving Machine Learning in Cloud

Conference Paper

Full-text available

Nov 2017

Machine learning algorithms based on deep neural networks (NN) have achieved remarkable results and are being extensively used in different domains. On the other hand, with increasing growth of cloud services, several Machine Learning as a Service (MLaaS) are offered where training and deploying machine learning models are performed on cloud provid...

Speculation resource provisioning in high-performance computing

Article

Full-text available

Jan 2017

Distributed computing gives a backing to buyers to diminish their inner foundation, and for suppliers to expand incomes, utilizing their own particular framework. The proper load balancing and dynamic resource provisioning improves cloud performance and attracts the cloud users. In this paper, we propose an automated resource provisioning algorithm...

Fig. 1. Comparing the error rate of the models generated using...

Fig. 2. Running time for different batch sizes.

Table 3 . Training Neural Network over Encrypted Data (batch input size...

Privacy-preserving Machine Learning as a Service

Article

Full-text available

Jun 2018

Machine learning algorithms based on deep Neural Networks (NN) have achieved remarkable results and are being extensively used in different domains. On the other hand, with increasing growth of cloud services, several Machine Learning as a Service (MLaaS) are offered where training and deploying machine learning models are performed on cloud provid...

Diagnosis-Effective Sampling of Application Traces

Preprint

Full-text available

Apr 2024

Distributed tracing is a cutting-edge technology for monitoring, managing, and troubleshooting native cloud applications. It offers a more comprehensive and continuous observability, surpassing traditional logging methods, and is indispensable for navigating modern complex software architectures. However, the sheer volume of generated traces is staggering in distributed applications, and direct storage and utilization of every trace are impractical due to associated operational costs. This entails a sampling strategy to select which traces warrant storage and analysis. Historically, sampling methods have included a rate-based approach, often relying heavily on a manual configuration. There is a need for a more intelligent approach, and we propose a hierarchical sampling methodology to address multiple requirements concurrently. Initial rate-based sampling mitigates the overwhelming volume of traces, as no further analysis can be performed on this level. In the next stage, more nuanced analysis is facilitated based on the previous foundation, incorporating information regarding trace properties and ensuring the preservation of vital process details even under extreme conditions. This comprehensive approach not only aids in the visualization and conceptualization of applications but also enables more targeted analysis in later stages. As we delve deeper into the sampling hierarchy, the technique becomes tailored to specific purposes, such as simplifying application troubleshooting. In this context, the sampling strategy prioritizes the retention of erroneous traces from dominant processes, thus facilitating the identification and resolution of underlying issues. The focus of this paper is to reveal the impact of the sampling on troubleshooting efficiency. Leveraging intelligent and explainable artificial intelligence solutions enables the detection of malfunctioning microservices and provides transparent insights into root causes. We advocate for using rule-induction systems, which offer explainability and efficacy in decision-making processes. By integrating advanced sampling techniques with machine learning-driven intelligence, we empower organizations to navigate the complexities of large-scale distributed cloud environments effectively.

A Study on Automated Problem Troubleshooting in Cloud Environments with Rule Induction and Verification

Preprint

Full-text available

Dec 2023

Remediation of IT issues encoded into domain-specific or user-defined alerts occurring in cloud environments and customer ecosystems in a vast majority of cases suffers from accurate recommendations which could be timely supplied for recovery of performance degradations. That, of course, is hard to realize through furnishing those abnormality definitions with an appropriate expert knowledge which varies from one environment to another. At the same time, in a large proportion of support cases, the reported problems under Global Support Services (GSS) or Site Reliability Engineering (SRE) treatment ultimately go down to the product teams making them waste costly development hours on investigating self-monitoring metrics of our solutions. Therefore, the mean-time-to-resolution (MTTR) rates of problems/alerts are significantly impacted from lack of a systematic approach towards adopting AI Ops. That would imply building, maintaining, and continuously improving/annotating a data store of insights that ML models are trained and generalized across the whole customer base and corporate cloud services. Our ongoing study is in line with such a vision and validates an approach that learns the alert resolution patterns in such a global setting and explains them using interpretable AI methodologies. The knowledge store of causative rules is then applied in predicting potential sources of the application degradation reflected in an active alert instance. In this communication, we share our experiences with a prototype solution and up to date analysis demonstrating how root conditions are discovered with a high accuracy for a specific type of problem. It is validated against the historical data of resolutions performed by heavy manual development efforts. We also offer a Dempster-Shafer theory-based rule verification framework for experts as a what-if analysis tool to test their hypotheses about underlying environment.

An Enterprise Time Series Forecasting System for Cloud Applications Using Transfer Learning

Article

Full-text available

Feb 2021
SENSORS-BASEL

The main purpose of an application performance monitoring/management (APM) software is to ensure the highest availability, efficiency and security of applications. An APM software accomplishes the main goals through automation, measurements, analysis and diagnostics. Gartner specifies the three crucial capabilities of APM softwares. The first is an end-user experience monitoring for revealing the interactions of users with application and infrastructure components. The second is application discovery, diagnostics and tracing. The third key component is machine learning (ML) and artificial intelligence (AI) powered data analytics for predictions, anomaly detection, event correlations and root cause analysis. Time series metrics, logs and traces are the three pillars of observability and the valuable source of information for IT operations. Accurate, scalable and robust time series forecasting and anomaly detection are the requested capabilities of the analytics. Approaches based on neural networks (NN) and deep learning gain an increasing popularity due to their flexibility and ability to tackle complex nonlinear problems. However, some of the disadvantages of NN-based models for distributed cloud applications mitigate expectations and require specific approaches. We demonstrate how NN-models, pretrained on a global time series database, can be applied to customer specific data using transfer learning. In general, NN-models adequately operate only on stationary time series. Application to nonstationary time series requires multilayer data processing including hypothesis testing for data categorization, category specific transformations into stationary data, forecasting and backward transformations. We present the mathematical background of this approach and discuss experimental results based on implementation for Wavefront by VMware (an APM software) while monitoring real customer cloud environments.

An Enterprise Time Series Forecasting System for Cloud Applications Using Transfer Learning

Preprint

Full-text available

Jan 2021

One of the key components of application performance monitoring (APM) software is 2 AI/ML empowered data analytics for predictions, anomaly detection, event correlations and root 3 cause analysis. Time series metrics, logs and traces are three pillars of observability and the valuable 4 source of information for IT operations. Accurate, scalable and robust time series forecasting and 5 anomaly detection are desirable capabilities of the analytics. Approaches based on neural networks 6 (NN) and deep learning gain increasing popularity due to their flexibility and ability to tackle complex 7 non-linear problems. However, some of the disadvantages of NN-based models for distributed cloud 8 applications mitigate expectations and require specific approaches. We demonstrate how NN-models 9 pretrained on a global time series database can be applied to customer specific data using transfer 10 learning. In general, NN-models adequately operate only on stationary time series. Application 11 to non-stationary time series requires multilayer data processing including hypothesis testing for 12 data categorization, category specific transformations into stationary data, forecasting and backward 13 transformations. We present the mathematical background of this approach and discuss experimental 14 results from the productized implementation in Wavefront by VMware (an APM software) while 15 monitoring real customer cloud environments.

Estimating Efficient Sampling Rates of Metrics for Training Accurate Machine Learning Models

Conference Paper

Full-text available

Sep 2020

Cloud management solutions provide full real-time visibility into modern software-defined data centers (SDDC) of high complexity and sophistication through measuring millions of indicators with increasingly high sampling rate. This high frequency monitoring of metrics allows capturing the expected ever-growing dynamism of business-critical applications resulting in huge bases of time series data to be stored for analysis, pattern detection, and training predictive/forecasting models. That causes high analytics overhead and product performance issues. Therefore, identifying optimal sampling rates of time series data subject to preserving their main information content could mitigate this issue. A particular use case is tuning the sampling rates to be efficient for training ML models accurate enough in analytics tasks, such as anomaly detection. In this paper, we analyze a large collection of cloud application metrics and show that the sampling rate can be substantially reduced with a small information divergence. Moreover, we show that those anomaly detection modules perform sufficiently/tolerably accurate for the reduced data sets.

Fingerprinting Data Center Problems with Association Rules

Conference Paper

Full-text available

Sep 2020

Cloud management technologies increasingly automate different aspects of data center administration, where the final goal is to make self-driving solutions. Learning fingerprints of KPI-or SLO-impacting performance problems in IT infrastructures is a relevant task towards such a vision. Instead of defining problem types for data center components (resources/objects of various kinds) using domain knowledge, which is hard to obtain and unreliable because of complexities and sophistication of modern cloud systems, we propose a ML framework to detect those issue categories. Then alerting engines can run on top of those patterns to notify the users on conditions that are impacting system's KPIs thus providing explainability for troubleshooting and long-term performance optimization of the infrastructure. We consider several scenarios for learning problem definitions in terms of constructs by vRealize Operations-one of the leading solutions in the cloud management market. Using association rules mining concepts we can recommend problem patterns (fingerprints) in form of minimum size attribute combinations that constitute core structures highly associated with degradation of the KPI or SLO loss. We demonstrate experimental insights on virtualized environments applying our prototype algorithm.

Intelligent Troubleshooting in Data Centers with Mining Evidence of Performance Problems

Conference Paper

Full-text available

Sep 2020

Identifying actual root causes of a performance issue within a modern cloud infrastructure with high level of scale, sophistication, and complexity, is a hard task. It is especially complicated to diagnose a service or infrastructure degradation of an unknown nature, when no active alert is enough indicative about potential sources (be it an object, its metric, property, or an associated event) of the problem. In such a situation, the data center administration is intuitively looking for changes in the system that might reveal the causative factors. This requires costly investigations and results in business-critical losses. Cloud management vendors are building visions around AI Ops-enabled automation of the entire workflow of root cause analysis and troubleshooting. We propose a solution towards such a vision which is based on hypothesis testing and machine learning approaches for automatic mining "important changes" of various kinds in behavior of data center objects across time and infrastructure topology. Those are the most relevant evidence patterns expected to explain the performance issue. Our current implementation which is integrated into vRealize Operations runs on the available three sorts of monitoring data-metrics, properties, and events. However , the full vision is to extensively include more observability provided by other cloud management tools vertically scaled to capture the depth of a specific dimension of the data center administration. The implemented module produces lists of recommended patterns across those three dimensions rank ordered subject to different criteria for each, such as confidence (p-value) provided by hypothesis testing and magnitude of change in the metric data, event's sentiment score or abnormality degree, unexpectedness/entropy of property variations, etc. We describe the main analytical concepts behind the solution and demonstrate its validation in an application troubleshooting scenario.

Identifying Changed or Sick Resources from Logs

Conference Paper

Full-text available

Sep 2018

The identification of important changes in a complex distributed system is a challenging data science problem. Solving this problem is critical for tools for managing modern cloud infrastructure stacks and other large complex distributed systems. In this paper, we investigate two specific approaches to using log data to solve this problem. The first approach is comparing a source’s current and past behavior. Some solutions that perform anomaly detection on numeric data from the data center are inevitably relying on global change point detection concepts. On the other hand, while log data promises a significantly different perspectives and dimensions to accomplish a similar task, state-of-the-art of solutions lack a capability to automatically detect significant change points in the log stream of an event source through learning its behavioral patterns. Such change points indicate the most important times when the source’s behavior significantly differs from the past. A second complementary approach to real-time change detection involves comparing a source’s current behavior with the current behavior of its peers in a population of sources serving a common role in the data center. Employing the concept of event types of log messages introduced earlier, we propose algorithms for each of these approaches that apply classical statistical and machine learning techniques to data capturing the distribution of those constructs. We demonstrate experimental results from our prototype algorithms.

Compression for Time Series Databases Using Independent and Principal Component Analysis

Conference Paper

Full-text available

Jul 2017

Reliable management of modern cloud computing infrastructures is unrealizable without monitoring and analysis of a huge number of system indicators (metrics) as time series data stored in big databases. Efficient storage and processing of collected historical data from all “objects” of those infrastructures are technology challenges for this Big Data application. We propose a data compression framework for databases of time series that applies correlation content of the data set. Specifically, the fundamental statistical concepts of independent component analysis (ICA) and principal component analysis (PCA) are employed to demonstrate the viability of the approach. We experimentally show significant compression rates for real data sets from IT systems.

Experiences in Building an Enterprise Data Analytics

Conference Paper

Full-text available

Oct 2016

The Information Age made data easily accessible and omnipresent, currently with features of big volume, high velocity, and large variety, never seen before. For sciences, that is an unbelievable opportunity to explain the world better. Moreover, in the post-Information Age, businesses make any attempt to collect data and deeply benefit from it to achieve highly innovative technologies in terms of automation, performance, and efficiency. We share our experiences in building an enterprise data analytics for managing modern cloud computing infrastructures, as well as make parallels with information theory problems.

Independents vs dependents.

Context in source publication

Similar publications

Citations