Figure 2 - uploaded by Ashot N. Harutyunyan
Content may be subject to copyright.
Independents vs dependents.  

Independents vs dependents.  

Source publication
Conference Paper
Full-text available
Today's IT management faces the problem of " virtualized big environments " with hundreds of thousands of objects/resources as virtual machines, hosts, clusters, etc., evolving into cloud services. Admins of those infrastructures heavily rely on smart data-agnostic approaches to get reliable and accurate information regarding any current or upcomin...

Context in source publication

Context 1
... 1 and 2 illustrate the volumes of the categories we got for this environment. Fig. 1 shows the percentage of constants, dependents, and independents against all metrics, and Fig. 2 shows the percentages of dependents and independents against variable ...

Similar publications

Conference Paper
Full-text available
We present cCube, a microservices open source architecture used to automatically create an application of one or more Evolutionary Machine Learning (EML) classification algorithms that can be deployed to the cloud with automatic data factorization, training, result filtering and fusion.
Article
Full-text available
An increasing number of technology enterprises are adopting cloud-native architectures to offer their web-based products, by moving away from privately-owned data-centers and relying exclusively on cloud service providers. As a result, cloud vendors have lately increased, along with the estimated annual revenue they share. However, in the process o...
Conference Paper
Full-text available
Machine learning algorithms based on deep neural networks (NN) have achieved remarkable results and are being extensively used in different domains. On the other hand, with increasing growth of cloud services, several Machine Learning as a Service (MLaaS) are offered where training and deploying machine learning models are performed on cloud provid...
Article
Full-text available
Distributed computing gives a backing to buyers to diminish their inner foundation, and for suppliers to expand incomes, utilizing their own particular framework. The proper load balancing and dynamic resource provisioning improves cloud performance and attracts the cloud users. In this paper, we propose an automated resource provisioning algorithm...
Article
Full-text available
Machine learning algorithms based on deep Neural Networks (NN) have achieved remarkable results and are being extensively used in different domains. On the other hand, with increasing growth of cloud services, several Machine Learning as a Service (MLaaS) are offered where training and deploying machine learning models are performed on cloud provid...

Citations

... System administrators can no longer perform real-time decision-making due to the growth of largescale distributed cloud environments with complicated, invisible underlying processes. Those systems require more advanced and ML/AI-empowered intelligent RCA with explainable and actionable recommendations (see [45][46][47][48][49][50][51][52][53] with references therein). ...
Preprint
Full-text available
Distributed tracing is a cutting-edge technology for monitoring, managing, and troubleshooting native cloud applications. It offers a more comprehensive and continuous observability, surpassing traditional logging methods, and is indispensable for navigating modern complex software architectures. However, the sheer volume of generated traces is staggering in distributed applications, and direct storage and utilization of every trace are impractical due to associated operational costs. This entails a sampling strategy to select which traces warrant storage and analysis. Historically, sampling methods have included a rate-based approach, often relying heavily on a manual configuration. There is a need for a more intelligent approach, and we propose a hierarchical sampling methodology to address multiple requirements concurrently. Initial rate-based sampling mitigates the overwhelming volume of traces, as no further analysis can be performed on this level. In the next stage, more nuanced analysis is facilitated based on the previous foundation, incorporating information regarding trace properties and ensuring the preservation of vital process details even under extreme conditions. This comprehensive approach not only aids in the visualization and conceptualization of applications but also enables more targeted analysis in later stages. As we delve deeper into the sampling hierarchy, the technique becomes tailored to specific purposes, such as simplifying application troubleshooting. In this context, the sampling strategy prioritizes the retention of erroneous traces from dominant processes, thus facilitating the identification and resolution of underlying issues. The focus of this paper is to reveal the impact of the sampling on troubleshooting efficiency. Leveraging intelligent and explainable artificial intelligence solutions enables the detection of malfunctioning microservices and provides transparent insights into root causes. We advocate for using rule-induction systems, which offer explainability and efficacy in decision-making processes. By integrating advanced sampling techniques with machine learning-driven intelligence, we empower organizations to navigate the complexities of large-scale distributed cloud environments effectively.
... This communication relates to and builds upon our prior research [18][19][20][21][22][23][24][25] on various specific tasks in cloud diagnostics and administration comprising time series forecasting, anomaly and change detection in not only such structured data but also in logs and traces, as well as event correlation analytics and abnormality root cause inference from those types of information sources for a comprehensive automation towards self-driving data centers. ...
Preprint
Full-text available
Remediation of IT issues encoded into domain-specific or user-defined alerts occurring in cloud environments and customer ecosystems in a vast majority of cases suffers from accurate recommendations which could be timely supplied for recovery of performance degradations. That, of course, is hard to realize through furnishing those abnormality definitions with an appropriate expert knowledge which varies from one environment to another. At the same time, in a large proportion of support cases, the reported problems under Global Support Services (GSS) or Site Reliability Engineering (SRE) treatment ultimately go down to the product teams making them waste costly development hours on investigating self-monitoring metrics of our solutions. Therefore, the mean-time-to-resolution (MTTR) rates of problems/alerts are significantly impacted from lack of a systematic approach towards adopting AI Ops. That would imply building, maintaining, and continuously improving/annotating a data store of insights that ML models are trained and generalized across the whole customer base and corporate cloud services. Our ongoing study is in line with such a vision and validates an approach that learns the alert resolution patterns in such a global setting and explains them using interpretable AI methodologies. The knowledge store of causative rules is then applied in predicting potential sources of the application degradation reflected in an active alert instance. In this communication, we share our experiences with a prototype solution and up to date analysis demonstrating how root conditions are discovered with a high accuracy for a specific type of problem. It is validated against the historical data of resolutions performed by heavy manual development efforts. We also offer a Dempster-Shafer theory-based rule verification framework for experts as a what-if analysis tool to test their hypotheses about underlying environment.
... Self-driving data centers require the availability of proactive Analytics with AI for IT operations (AIOps) [1] in view of nowadays very large and distributed cloud environments. The key capabilities of the AIOps are predictions, anomaly detection, correlations and root cause analysis on all acquired data including traces, logs and time series (see [2][3][4][5][6][7][8][9][10] with references therein). ...
Article
Full-text available
The main purpose of an application performance monitoring/management (APM) software is to ensure the highest availability, efficiency and security of applications. An APM software accomplishes the main goals through automation, measurements, analysis and diagnostics. Gartner specifies the three crucial capabilities of APM softwares. The first is an end-user experience monitoring for revealing the interactions of users with application and infrastructure components. The second is application discovery, diagnostics and tracing. The third key component is machine learning (ML) and artificial intelligence (AI) powered data analytics for predictions, anomaly detection, event correlations and root cause analysis. Time series metrics, logs and traces are the three pillars of observability and the valuable source of information for IT operations. Accurate, scalable and robust time series forecasting and anomaly detection are the requested capabilities of the analytics. Approaches based on neural networks (NN) and deep learning gain an increasing popularity due to their flexibility and ability to tackle complex nonlinear problems. However, some of the disadvantages of NN-based models for distributed cloud applications mitigate expectations and require specific approaches. We demonstrate how NN-models, pretrained on a global time series database, can be applied to customer specific data using transfer learning. In general, NN-models adequately operate only on stationary time series. Application to nonstationary time series requires multilayer data processing including hypothesis testing for data categorization, category specific transformations into stationary data, forecasting and backward transformations. We present the mathematical background of this approach and discuss experimental results based on implementation for Wavefront by VMware (an APM software) while monitoring real customer cloud environments.
... Self-driving data centers require the availability of proactive Analytics with AI for IT operations (AIOps) [1] in view of nowadays very large and distributed cloud environments. The key capabilities of the AIOps are predictions, anomaly detection, correlations and root cause analysis on all acquired data including traces, logs and time series (see [2][3][4][5][6][7][8][9][10] with references therein). cloud environment can be performed without GPU acceleration and with moderate number of CPU cores. ...
Preprint
Full-text available
One of the key components of application performance monitoring (APM) software is 2 AI/ML empowered data analytics for predictions, anomaly detection, event correlations and root 3 cause analysis. Time series metrics, logs and traces are three pillars of observability and the valuable 4 source of information for IT operations. Accurate, scalable and robust time series forecasting and 5 anomaly detection are desirable capabilities of the analytics. Approaches based on neural networks 6 (NN) and deep learning gain increasing popularity due to their flexibility and ability to tackle complex 7 non-linear problems. However, some of the disadvantages of NN-based models for distributed cloud 8 applications mitigate expectations and require specific approaches. We demonstrate how NN-models 9 pretrained on a global time series database can be applied to customer specific data using transfer 10 learning. In general, NN-models adequately operate only on stationary time series. Application 11 to non-stationary time series requires multilayer data processing including hypothesis testing for 12 data categorization, category specific transformations into stationary data, forecasting and backward 13 transformations. We present the mathematical background of this approach and discuss experimental 14 results from the productized implementation in Wavefront by VMware (an APM software) while 15 monitoring real customer cloud environments.
... Data reduction in production analytics is an important technology challenge (see Poghosyan et al [7]. Relevant ideas linking to information bottleneck principle can be found in Harutyunyan et al [8]. ...
Conference Paper
Full-text available
Cloud management solutions provide full real-time visibility into modern software-defined data centers (SDDC) of high complexity and sophistication through measuring millions of indicators with increasingly high sampling rate. This high frequency monitoring of metrics allows capturing the expected ever-growing dynamism of business-critical applications resulting in huge bases of time series data to be stored for analysis, pattern detection, and training predictive/forecasting models. That causes high analytics overhead and product performance issues. Therefore, identifying optimal sampling rates of time series data subject to preserving their main information content could mitigate this issue. A particular use case is tuning the sampling rates to be efficient for training ML models accurate enough in analytics tasks, such as anomaly detection. In this paper, we analyze a large collection of cloud application metrics and show that the sampling rate can be substantially reduced with a small information divergence. Moreover, we show that those anomaly detection modules perform sufficiently/tolerably accurate for the reduced data sets.
... Based on the Dynamic and Hard Thresholding techniques ( [3], [4]) employing sophisticated statistical inference methods on time series metrics measured from the entire data center, vR Ops is capable to detect every atomic change/outlier (against historically representative behavior of the monitoring flow) or anomaly occurring in the system not primarily yielding a malfunction. Extra sources for atomic anomalies can be different monitoring platforms such as log management products (e.g., vR LI [2], see relevant ML approaches [5]- [7] developed in this area). ...
Conference Paper
Full-text available
Cloud management technologies increasingly automate different aspects of data center administration, where the final goal is to make self-driving solutions. Learning fingerprints of KPI-or SLO-impacting performance problems in IT infrastructures is a relevant task towards such a vision. Instead of defining problem types for data center components (resources/objects of various kinds) using domain knowledge, which is hard to obtain and unreliable because of complexities and sophistication of modern cloud systems, we propose a ML framework to detect those issue categories. Then alerting engines can run on top of those patterns to notify the users on conditions that are impacting system's KPIs thus providing explainability for troubleshooting and long-term performance optimization of the infrastructure. We consider several scenarios for learning problem definitions in terms of constructs by vRealize Operations-one of the leading solutions in the cloud management market. Using association rules mining concepts we can recommend problem patterns (fingerprints) in form of minimum size attribute combinations that constitute core structures highly associated with degradation of the KPI or SLO loss. We demonstrate experimental insights on virtualized environments applying our prototype algorithm.
... Various approaches have been developed for anomaly and change detection using history of monitoring data (both structured/metrics [8]- [9] and unstructured/logs [10]- [12]) to assist data center admins in faster RCA and troubleshooting. Prior related art in this domain was focused also on similarity analysis of data center incidents in their reoccurrence (see PhD thesis by Bodik [13], references therein, including a direct ascendant work by Cohen et al [14] on identifying crisis signatures). ...
Conference Paper
Full-text available
Identifying actual root causes of a performance issue within a modern cloud infrastructure with high level of scale, sophistication, and complexity, is a hard task. It is especially complicated to diagnose a service or infrastructure degradation of an unknown nature, when no active alert is enough indicative about potential sources (be it an object, its metric, property, or an associated event) of the problem. In such a situation, the data center administration is intuitively looking for changes in the system that might reveal the causative factors. This requires costly investigations and results in business-critical losses. Cloud management vendors are building visions around AI Ops-enabled automation of the entire workflow of root cause analysis and troubleshooting. We propose a solution towards such a vision which is based on hypothesis testing and machine learning approaches for automatic mining "important changes" of various kinds in behavior of data center objects across time and infrastructure topology. Those are the most relevant evidence patterns expected to explain the performance issue. Our current implementation which is integrated into vRealize Operations runs on the available three sorts of monitoring data-metrics, properties, and events. However , the full vision is to extensively include more observability provided by other cloud management tools vertically scaled to capture the depth of a specific dimension of the data center administration. The implemented module produces lists of recommended patterns across those three dimensions rank ordered subject to different criteria for each, such as confidence (p-value) provided by hypothesis testing and magnitude of change in the metric data, event's sentiment score or abnormality degree, unexpectedness/entropy of property variations, etc. We describe the main analytical concepts behind the solution and demonstrate its validation in an application troubleshooting scenario.
... 1. change point detection algorithms [6,7] that capture unusual behaviors in the quantified event trends data (or divergences in event types); 2. outlier detection algorithms (like [13] based on extreme value theory and maximum entropy principle) to detect spikes/abrupt changes in the quantified event trends data; 3. periodicity analysis (with Dynamic Thresholding [3]) to exclude the cases of spikes that might have cyclical nature (e.g. nightly backups). ...
Conference Paper
Full-text available
The identification of important changes in a complex distributed system is a challenging data science problem. Solving this problem is critical for tools for managing modern cloud infrastructure stacks and other large complex distributed systems. In this paper, we investigate two specific approaches to using log data to solve this problem. The first approach is comparing a source’s current and past behavior. Some solutions that perform anomaly detection on numeric data from the data center are inevitably relying on global change point detection concepts. On the other hand, while log data promises a significantly different perspectives and dimensions to accomplish a similar task, state-of-the-art of solutions lack a capability to automatically detect significant change points in the log stream of an event source through learning its behavioral patterns. Such change points indicate the most important times when the source’s behavior significantly differs from the past. A second complementary approach to real-time change detection involves comparing a source’s current behavior with the current behavior of its peers in a population of sources serving a common role in the data center. Employing the concept of event types of log messages introduced earlier, we propose algorithms for each of these approaches that apply classical statistical and machine learning techniques to data capturing the distribution of those constructs. We demonstrate experimental results from our prototype algorithms.
... Data science and machine learning approaches are applied to the collected data for behavioral pattern analysis and anomaly detection, problem root causing and other relevant intelligent tasks as predictive analytics in cloud systems. A package of such enterprise solutions and implementations are described in our earlier papers [2]- [5] dealing with both sources of datastructured and log, where, in particular, information-theoretic measures are applied in statistical inferences (see also notes by Vinck [6] on information theory and big data problems). ...
Conference Paper
Full-text available
Reliable management of modern cloud computing infrastructures is unrealizable without monitoring and analysis of a huge number of system indicators (metrics) as time series data stored in big databases. Efficient storage and processing of collected historical data from all “objects” of those infrastructures are technology challenges for this Big Data application. We propose a data compression framework for databases of time series that applies correlation content of the data set. Specifically, the fundamental statistical concepts of independent component analysis (ICA) and principal component analysis (PCA) are employed to demonstrate the viability of the approach. We experimentally show significant compression rates for real data sets from IT systems.
... Building a generic data analytics platform to target such a goal in a context-independent way is a hard problem. Our experiences in providing a real-time performance analytics for data centers are summarized in a system (see [3], [6,7], and [8]) of several modules encompassing ( Fig. 1) -behavioral analysis for time series data and extreme value analysis: typical vs. atypical behavior to judge about anomalies based on data categorization, change point and periodicity detections; -abnormality degree estimation for an outlying process to measure its severity or form an anomaly event; -ranking of events in terms of their impact factor and problem root causing; -principal feature analysis and event reduction; -data compression; -prediction of alterations in the system (allows sparing computational resources needed to run expensive behavioral pattern extraction procedures) and other building blocks. ...
... Concepts of information theory and its measures help in tackling problems we are working on, such as pattern and anomaly detection in logs [4,5], extreme value analysis applying a maximum entropy principle [8], identification of problem root causes [3], and feedback-enhanced analytics [7]. ...
Conference Paper
Full-text available
The Information Age made data easily accessible and omnipresent, currently with features of big volume, high velocity, and large variety, never seen before. For sciences, that is an unbelievable opportunity to explain the world better. Moreover, in the post-Information Age, businesses make any attempt to collect data and deeply benefit from it to achieve highly innovative technologies in terms of automation, performance, and efficiency. We share our experiences in building an enterprise data analytics for managing modern cloud computing infrastructures, as well as make parallels with information theory problems.