8_Li23_Conan - Diagnosing Batch Failures for Cloud Systems

.pdf

School

Concordia University *

*We aren’t endorsed by this school

Course

691

Subject

Information Systems

Date

Oct 30, 2023

Type

pdf

Pages

12

Uploaded by BaronSandpiperMaster927

Report
CONAN: Diagnosing Batch Failures for Cloud Systems Liqun Li , Xu Zhang , Shilin He , Yu Kang , Hongyu Zhang o , Minghua Ma , Yingnong Dang , Zhangwei Xu , Saravan Rajmohan , Qingwei Lin ‡∗ , Dongmei Zhang Microsoft Research , Microsoft Azure , Microsoft 365 , o The University of Newcastle Abstract —Failure diagnosis is critical to the maintenance of large-scale cloud systems, which has attracted tremendous attention from academia and industry over the last decade. In this paper, we focus on diagnosing batch failure s, which occur to a batch of instances of the same subject (e.g., API requests, VMs, nodes, etc.), resulting in degraded service availability and performance. Manual investigation over a large volume of high- dimensional telemetry data (e.g., logs, traces, and metrics) is labor-intensive and time-consuming, like finding a needle in a haystack. Meanwhile, existing proposed approaches are usually tailored for specific scenarios, which hinders their applications in diverse scenarios. According to our experience with Azure and Microsoft 365 – two world-leading cloud systems, when batch failures happen, the procedure of finding the root cause can be abstracted as looking for contrast patterns by comparing two groups of instances, such as failed vs. succeeded, slow vs. normal, or during vs. before an anomaly. We thus propose CONAN, an efficient and flexible framework that can automatically ex- tract contrast patterns from contextual data. CONAN has been successfully integrated into multiple diagnostic tools for various products, which proves its usefulness in diagnosing real-world batch failures. I. I NTRODUCTION Failures of cloud systems (e.g., Azure[8], AWS[7], and GCP [6]) could notoriously disrupt online services and impair system availability, leading to revenue loss and user dissatis- faction. Hence, it is imperative to rapidly react to and promptly diagnose the failures to identify the root cause after their occurrence [10], [54]. In this work, we focus on diagnosing batch failure , which is a common type of failure widely found in cloud systems. A batch failure is composed of many individual instances of a certain subject (e.g., API requests, computing nodes, VMs), typically within a short time frame. For example, a software bug could cause thousands of failed requests [42]. In cloud systems, batch failures tend to be severe and usually manifest as incidents [20], which cause disruption or performance degradation. Batch failure in cloud systems could be caused by reasons including software and configuration changes [32], [48], power outages [1], disk and network failures [43], [31], etc. To diagnose batch failures, engineers often retrieve and carefully examine the contextual data of the instances, such as their properties (e.g., version or type), run-time information (e.g., logs or traces), environment and dependencies (e.g., nodes or routers), etc. Contextual information can often be expressed in Qingwei Lin is the corresponding author of this work. the form of attribute-value pairs (AVPs) denoted as Attribute- Name=“Value”. Let’s say a request emitted by App APP1 is served by a service of APIVersion V1 hosted on Node N1. Then, the contextual information of this specific request can be expressed with 3 AVPs, i.e., App=“APP1”, APIVersion=“V1”, and Node=“N1”. If a batch of failed API requests occurs due to the incompatibility between a specific service version (say APIVersion=“V1”) and a certain client application (say App=“APP1”). Then, the combination of { APIVersion=“V1”, App=“APP1” } is what engineers aim to identify during failure diagnosis. Identifying useful AVPs from the contextual data is often “case-by-case” based on “expert knowledge” for “diverse” sce- narios according to our interview with engineers. Fortunately, a natural principle usually followed by engineers is to compare two groups of instances, such as failed vs. succeeded, slow vs. normal, or during vs. before an anomaly, etc. The objective is to look for a set of AVPs that can significantly differentiate the two groups of instances. We call such a set of AVPs a contrast pattern . However, manual examination of contrast patterns, which requires comparison over a large volume of high-dimensional contextual data, is labor-intensive, and thus cannot scale up well. In one scenario (Sec. V-C), it takes up to 20 seconds to visualize only one data pivot, while there are hundreds of thousands of AVP combinations. Recently, many failure diagnosis methods have been proposed (summarized in Table I). These approaches have been shown to be useful in solving a variety of problems. However, they are rigid in adapting to new scenarios, so developers have to re-implement or even re-design these approaches for new scenarios. In this paper, we summarize common characteristics from diverse diagnosis scenarios, based on which we propose a uni- fied data model to represent various types of contextual data. We propose a framework, namely CONAN, to automatically search for contrast patterns from the contextual data. CONAN first transforms diverse input data into a unified format, and then adopts a meta-heuristic search method [17] to extract contrast patterns efficiently. Finally, a consolidation process is proposed, by considering the concept hierarchy in the data, to produce a concise diagnosis result. The main advantage of CONAN over existing methods [51], [10], [44], [40] is that it can be applied flexibly to various scenarios. CONAN supports diagnosing both availability and performance issues, whereas existing work as summarized in Table I only supports one of them. In CONAN, it is flexible
to measure the significance of patterns in differentiating the two instance groups. For example, CONAN can find patterns that are prevalent in abnormal instances but rare in normal ones, or patterns with a significantly higher latency only during the incident time. Moreover, CONAN supports multiple data types including tabular telemetry data and console logs. We have integrated CONAN in diagnostic tools used by Azure and Microsoft 365 – two world-leading cloud systems. In the last 12 months, CONAN had helped diagnose more than 50 incidents from 9 scenarios. Its advantages have been affirmed in real-world industrial practice, greatly saving engineers’ time and stopping incidents from impacting more services and customers. To summarize, our main contributions are as follows: To the best of our knowledge, we are the first to unify the diagnosis problem in different batch failure scenarios into a generic problem of extracting contrast patterns. We propose an efficient and flexible diagnosis framework CONAN, which can be applied to diverse scenarios. We integrate CONAN in a variety of typical real-world products and share our practices and insights in diagnos- ing cloud batch failures from 5 typical scenarios. The rest of this paper is organized as follows. In Sec. II, we introduce the background of batch failure and demonstrate an industrial example. Sec. III presents the data model and problem formulation. The proposed framework CONAN and its implementation are described in Sec. IV. We show real- world applications in Sec. V. At last, we discuss CONAN in Sec. VI, summarize related work in Sec. VIII, and conclude the paper in Sec. IX. II. B ACKGROUND AND M OTIVATING E XAMPLE For a cloud system, when a batch of instances (e.g., requests, VMs, nodes) fail, the monitoring infrastructure can detect them immediately, create an incident ticket, and send alerts to on-call engineers to initiate the investigation. Although various automated tools (e.g., auto-collecting exceptional logs and traces) have been built to facilitate the diagnosis process, how to quickly diagnose the failure remains the bottleneck, especially due to the excessive amount of telemetry data. A. A Real-world Scenario: Safe Deployment The Exchange service is a large-scale online service in Microsoft 365, which is responsible for message hosting and management. To reduce the impact caused by defective software changes, the service employs a safe deployment mechanism, similar to what is described in [5], [16], [32]. That is, a new build version needs to go through several release phases before meeting customers. Due to the progressive delivery process, multiple build versions would co-exist in the deployment environment, as shown in Fig. 1, where nodes with different colors are deployed with different versions. Client applications interact via REST API requests. A batch of failed requests would trigger an incident. The key question for safe deployment is: Is the problem caused by a recent deployment or by ambient noise such as a hardware Exchange Cluster AG V1 V2 V3 APP1 APP2 APP3 Fig. 1. Multiple build versions co-exist during service deployment life-cycle. Nodes with different colors correspond to different build versions. failure or a network issue ? To answer this question, a snapshot of contextual data during the incident is collected. The data typically contains tens of millions of instances of requests for this large-scale online service. Each instance represents a failed or succeeded request. A request has many associated attributes. To give a few examples, it has the APP attribute that denotes the client application which invokes this API, and the APIVersion that shows the server-side software build version of the invoked API. There are attributes, such as Cluster, Availability Group (AG), and Node, that describe the location where requests are routed and served 1 . We aim to answer the aforementioned question based on the collected request data. The idea behind their existing practice is intuitive. If the in- cident is caused by a build version, then its failure rate should be higher than other versions. For example, if APIVersion V3 has a significantly higher failure rate than V1 and V2, the production team suspects that this is a deployment issue caused by V3. Then, engineers deep dive into the huge code base of the suspected version. It could waste a lot of time before they can confirm that this is not a code regression if the initial direction given by the safe deployment approach is wrong. This approach sheds light on diagnosing batch failures by comparing the occurrence of an attribute (APIVersion in our example) among two categories of instances, i.e., failed and succeeded requests. However, it has two drawbacks. First, it only concerns the APIVersion attribute while neglecting other attributes. A large number of failed requests emerging on a specific version might be caused by problems such as node or network failures, resulting in misattributing a non-deployment issue to a build version problem. Second, a batch failure could be caused by a combination of multi-attributes containing a certain version, as will be discussed in Sec. V-A. Then, the failure rate of any single version might not be significantly higher than other versions, leading to misidentification of the build version problem. Either way would lead to a prolonged diagnosis process. The production team could certainly enrich their method by adding heuristic rules to cover more cases. However, it is ad-hoc and error-prone. In summary, we need a systematic solution to diagnose batch failures for safe deployment. 1 One cluster consists of multiple AGs, and one AG is composed of 16 nodes.
TABLE I C OMMONALITIES IN DIFFERENT DIAGNOSIS WORK Failed instances Contextual data Contrast class diagnostic output Requests [42], [12], [52] Attributes and components on the critical paths or call graphs Slow vs. normal requests or failure vs. successful requests Attribute combinations, e.g., { Cluster= “PrdC01”, API= “GET” } , problematic components, or structural mutations Software crash reports [41], [19], [38], [33], [47] Attributes or traces, e.g., OS, modules, navigation logs, events, etc. Crash vs. non-crash or different types of crashes Attribute combinations, e.g., { OS=“Android”, Event=“upload” } or event sequence or func- tions OS events [49] Driver function calls and status along the traces Slow vs. normal event traces Combination of drivers and functions, e.g., { fv.sys!Func1, fs.sys!Func2, se.sys!Func3 } Virtual disks [51] VM, storage account, and net- work topology Failure vs. normal disks Problematic component, e.g., a router or a stor- age cluster Customer reports [35] Attributes, e.g., country, feature, etc. Reports during anomaly vs. be- fore anomaly Attribute combinations, e.g., { Country=“India”, Feature=“Edit” } System operations [39], [14] System logs or attributes, e.g., server, API, etc. Slow vs. normal operations Attribute combinations, e.g., { Latency ¿ 568ms, Region=“America” } or a set of indicative logs System KPIs [53], [25] System logs Logs during vs. before the KPI anomaly A set of indicative logs III. B ATCH F AILURE D IAGNOSIS In this section, we first summarize commonalities among the practices of diagnosing batch failures. We then abstract the data model and formulate the failure diagnosis problem as a problem of identifying contrast patterns. A. Commonalities in Scenarios The first step in solving the batch failure diagnosis problem is to summarize the commonalities among diverse scenarios. However, the circumstances of the failures can appear so dif- ferent that it is challenging to determine their commonalities. We have carefully analyzed the various cases encountered in practice, such as the safe deployment case introduced previously, and thoroughly reviewed existing research work. Table I presents a summary of literature studies on failure diagnosis. These studies focus on notably different scenarios, but they share the following commonalities: They are all failures that affect a batch of instances of the same subject (such as requests, software crashes, disks, and so on), instead of on a single or a handful of instances. Each failure instance has a collection of attributes that describe the context in which it occurred. Besides the instances involved in each batch failure (called Target class ), we can always find another set of instances with different statuses or performance levels (called Background class ) for comparison. The diagnosis output, based on the instances and their attributes, could be represented as a combination of attribute-value pairs (AVPs). In our motivating scenario, the background class of in- stances are the requests that were successfully processed during the incident time. When the batch failure is about performance issues, such as high latency in request response time, we take the requests before the latency surge as the background class. By comparing the attributes of two sets of instances, we can identify patterns, namely contrast patterns, which can help narrow down the search for the root cause of the failure. The commonalities in data characteristics and diagnosis process suggest that a generic approach could be effective in addressing the batch failure diagnosis problem. Since each scenario may look quite different, the framework must be flexible enough to benefit both current and future scenarios. B. Contrast Pattern Identification Contrast patterns in two contrast classes often demonstrate statistically significant differences. For instance, one pattern is more prevalent in the target class than the background class, or instances of a pattern have a higher average latency in the target class than in the background class. The contrast pattern extraction can thus be formulated as a search problem. To achieve so, an objective function should be defined first, and then maximized during the following search process. We denote the objective function as f B,T ( p ) , which quantitatively measures the difference of the pattern p in the two classes { B, T } . B and T stand for background class and target class, respectively. The goal of the search process is to find the pattern ˆ p which maximizes the objective function, i.e., ˆ p = arg max p f B,T ( p ) (1) The objective function can vary across different diagnosis scenarios, as will be demonstrated in the practical examples in Sec. V. We now show the objective function, in Eq. (2), for the safe deployment example in Sec. II-A. The objective function is defined as the proportion difference between the failed and succeeded requests of any specific pattern p . f B,T ( p ) = | S T ( p ) | | S T | | S B ( p ) | | S B | (2) | · | denotes the number of instances, S T ( p ) and S B ( p ) are target-class and background-class instances, respectively, which contain the pattern p . The intuition is that the pattern p should be more prevalent in the failed requests than in the succeeded requests.
CONAN Meta-Heuristic Search Data Transformation Consolidation NoSQL Database Objective Function f B,T (p) Contrast Patterns Users Cloud Native Deployment CONAN Database CONAN Diagnosis Report Pattern 1 Pattern 2 ... Notify REST API Web Portal Fig. 2. An Overview of CONAN. IV. T HE P ROPOSED S YSTEM Several requirements are imposed for designing a generic framework for batch failure diagnosis. It needs to support various input data, search contrast patterns efficiently, and provide easy-to-interpret results. To fulfill these requirements, we propose a batch failure diagnosis framework, namely CONAN. Fig. 2 shows an overview of our system which consists of three components. Data transformation: We convert input data into our unified data model, i.e., instances. Each instance is rep- resented with a set of AVPs, a class label, and optionally its metric value (e.g., latency). Meta-heuristic search: Following a user-specified objec- tive function, contrast patterns are extracted by employing a meta-heuristic search algorithm. Consolidation: The search could incur duplication, i.e., multiple patterns describing the same group of instances. We design rules to consolidate the final output patterns to make them concise and easy to interpret. In most scenarios, the output patterns are consumed by engineers or system operators. They may perform follow-up tasks to identify the actual root cause for triage, mitigation, and problem-fixing. A. Data Transformation In this section, we introduce the practices of transforming diverse diagnosis data into a unified format. An instance is an atomic object on which our diagnosis framework is performed. For example, in the safe deployment scenario, the diagnosis aims to find out why a batch of requests fail, and a request is thus treated as one instance. Similarly, as depicted in Table I, an instance could be a software crash report, an OS event, a virtual disk disconnection alert, a customer report, etc. Attribute-value pair (AVP) is the data structure to denote the contextual data for diagnosis. In practice, there could be a large number of attributes. These attributes are usually scoped based on engineer experience to avoid involving unnecessary attributes or missing critical ones. For instance, we only care about the Client Application, APIVersion, Node, etc., whose issues can directly result in a request success rate drop for safe deployment. As batch failures could occur from time to time, the attributes may be adjusted gradually. In the beginning, engineers obtain a list of attributes based on their knowledge. In subsequent practice, new attributes are added or existing ones are removed depending on the situation. One can refer to the Contextual data column in Table I for typical attributes chosen for various batch failure diagnosis tasks. A contrast class is assigned for each instance, i.e., target class and background class. The target class labels instances of our interest. For scenarios such as safe deployment, each instance has its status, namely, success or failure, then the status can naturally serve as the contrast class label. Some- times, only “failed” instances are collected and the diagnosis purpose is to find out why a sudden increase of failures occurs (e.g., an anomaly). In this scenario, temporal information is used to decide the contrast class. We assign the target class to instances occurring during the anomaly duration and the background class to instances before the anomaly period. We shall present one such example in Sec. V-D. Multiple attributes may form a hierarchical relationship. In our safe deployment example in Sec. II-A, we say Node is a low-level attribute compared to AG (Availability Group) be- cause one AG contains multiple Nodes. Similarly, Node is also a lower-level attribute to APIVersion, as one APIVersion is typically deployed to multiple nodes while one node has only one APIVersion. For convenience, we denote the hierarchical relationship in a chain of attributes, e.g., Node −→ AG −→ Cluster. The chain starts from a low-level attribute (left side) and ends at a high-level attribute (right side). The hierarchical chains can be automatically mined from the input data because high-level attributes and their low-level attributes form one- to-many relationships, e.g., one AG corresponds to multiple nodes. CONAN provides a tool to analyze the hierarchical relationship in the input data for users. The hierarchy chains are then used for subsequent steps. B. Meta-heuristic Search To search for the pattern as desired, a straightforward way is to evaluate every possible pattern exhaustively with the objec- tive function. However, the method is clearly computationally inefficient. Though we can explore all combinations of 1 or 2 AVPs in a brute-force way, we cannot rigidly limit the pattern length in practice. Thus, contrast pattern extraction desires a more systematic search algorithm. In this work, we adopt a meta-heuristic search [17] frame- work but customized it especially to mine the contrast pattern. Compared to existing pattern mining [41], [19], [33] or model interpretation [51], [14] methods, heuristic search is more flexible to tailor for different scenarios. Meta-heuristic search is an effective combinatorial optimization approach. The input to the search framework is the data prepared as discussed in the above section. The output is a list of contrast patterns optimized toward high objective function values. The search framework features the following components: A contrast pattern list, denoted as L c . A current in-search pattern, denoted as p . A set of two search operations (i.e., ADD and DEL). An objective function f B,T ( p ) that evaluates a pattern p .
Null Add Fig. 3. Illustration of the search process where each uppercase character represents an AVP. In each step, the current pattern p (with black line color) randomly selects an operation (ADD or DEL) to transit to a new pattern. Patterns along the search path with the highest objective function values are kept in the contrast pattern list L c . The search process starts from an empty pattern Null and goes through many iterations to find patterns optimizing towards the objective function, as illustrated in Fig. 3. 1) The Search Process: In each iteration, we randomly apply an operation to the current pattern to generate a new pattern. ADD means that we add an AVP to the current pattern; DEL means that we delete one AVP from the current pattern. When adding an AVP, we avoid the AVPs with attributes already presented in the current pattern. Once we have a new pattern, we use the objective function to evaluate its score. We maintain the best contrast patterns found in a fixed-size list L c , which is the algorithm output. If the list L c is not full or the score is higher than the minimum score of stored patterns in L c , we add or update the pattern to L c . The search process is essentially finding desired patterns w.r.t the objective function, as defined in Eq. (2) for our example scenario. The search ends when a static or dynamic exit criterion is satisfied. In static criteria, we could end the search process after a predefined number of steps or a time interval. In dynamic criteria, the search stops if the pattern list L c does not update for certain steps. One advantage of meta-heuristic search is that it endorses the early-stopping mechanism naturally. We can end the pro- cess early as required and meanwhile obtain reasonably great results, which is very helpful in diagnosis scenarios under hard time constraints. Besides, we could explicitly limit the length of the pattern during the search process. Once the current pattern reaches the max length, we prevent the search process from ADD operations. 2) The Scoring Function: When applying the ADD opera- tion to the current pattern, simply choosing a random AVP is inefficient. Instead, an AVP should be added if it could benefit the new pattern, e.g., achieving a higher objective function score. In meta-heuristic search [55], the algorithm typically explores the whole neighborhood of the current state to find the next state that maximizes the objective function, which however is not computationally affordable due to the huge pattern space. As an approximation, we introduce a scoring function [26], [46] f s ( AVP ) for each AVP. Specifically, we inherit the objective function and calculate the score by f s ( AVP ) = f B,T ( { AVP } ) (3) where { AVP } is a pattern with only one attribute-value pair. When we need to pick an AVP to add to the current pattern, we choose the one with the highest score. Though the pattern space is large due to combination explosion, the number of AVPs is much smaller. Therefore, we can calculate a lookup table for the score of each AVP beforehand. In our implementation, we adopt the BMS (Best from Multiple Selections) mechanism [18], which selects the top few AVPs with the highest scores and then samples one from them with probability proportional to their scores. It is easy for a search algorithm to be stuck in a local search space if we add AVPs in a purely greedy fashion, i.e., always pick the AVP with the highest score. We thus maintain a tabu list, inspired by the Tabu search algorithm [21], which tracks recently explored AVPs to avoid revisiting them in a cycling fashion. The tabu list size plays a key role in balancing the exploration and exploitation of the search process. Recall that we have mined the hierarchy chains in Sec. IV-A. When the current pattern contains an AVP of a low- level attribute (e.g., Node), it makes no sense to add an AVP of high-level attributes (e.g., AG or Cluster). We thus enforce this rule during the search process to reduce the search space. C. Consolidation The output of the meta-heuristic search (Sec. IV-B) is a list of patterns sorted by their objective function values in descending order. We identify two situations that could lead to pattern redundancy, which could cause confusion to the user. They are shown in the following examples: p 1 : { App=“APP1”, APIVersion=“V1” } , p 2 : { APIVersion=“V1” } p 1 : { Node=“N01” } , p 2 : { AG=“AG01” } 2 In each example, we have two patterns p 1 and p 2 . We use S ( p 1 ) and S ( p 2 ) to denote the sets of instances containing p 1 and p 2 , respectively. In the first example, p 2 is actually a subset of p 1 given each pattern is a set of AVPs. Thus, p 1 is describing a smaller set of instances compared to p 2 , i.e., S ( p 1 ) S ( p 2 ) . This is also the case for the second example due to the existence of the hierarchy chain: Node −→ AG −→ Cluster. Patterns composed of low-level attributes are more specific than patterns with high-level attributes. Under both circumstances, it could cause confusion if both p 1 and p 2 are presented to the user. Taking the second situation as an example, users may wonder whether this is an AG-scale issue or only a single-node issue. Once we identify the redun- dancy situations, we apply rules to deduplicate and retain only the major contributor pattern in the result list. Specifically, we pick the pattern associated with the minimum number of instances if two patterns achieve comparable objective function scores, which provides more fine-grained hints to localizing the root cause. D. System Implementation We implement CONAN as a Python library for easy reuse and customization. We design the objective scoring function 2 Assume Node “N01” resides in AG “AG01”
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Browse Popular Homework Q&A

Q: The graph on the right depicts real money supply. 1.) Using the three-point curve drawing tool, draw…
Q: 2) The average rate of disappearance of ozone in the following reaction is found to be 8.93 x 103…
Q: Describe all solutions of Ax = 0 in parametric vector form, where A is row equivalent to the given…
Q: Solve the problem. June has a strip of paper 20.5 inches long. She wants to cut it into strips that…
Q: Equation: 2 H₂CO wol p-anisaldehyde Data Table Molecular formula Molecular Mass Density (get on…
Q: Exercise 9. Describe in plain English an algorithm that computes the number of common elements in 2…
Q: Use the Ratio Test or the Root Test to determine whether the following series converges absolutely…
Q: 2. Let A be a matrix whose singular value decomposition is given by [0.4 -0.4 -0.4 -0.4 0.6 [100 0 0…
Q: 3. A train to the western frontier will consist of 4 passenger cars, 3 cattle cars, and 2 luggage…
Q: x²³dx + ydy fr อ
Q: o A projectile shot at an elevated height h=2.0 m. Given initial speed v_0 = 60.0 m/s, theta= 45.0…
Q: 2012 Pearson Education inc surface D surface B surface C both surface C and surface D surface A
Q: Determine whether the following series converges or diverges. 5k Σ k=0 vk +4 Choose the correct…
Q: Use the Ratio Test or the Root Test to determine whether the following series converges absolutely…
Q: A projectile shot at an elevated height h=60.0 m at an angle 300 along the downward with respect to…
Q: Exercise 1. Describe in plain English (a short paragraph with at most 5-6 lines should be enough) n…
Q: The aforementioned concepts form the foundation of every high-level language.
Q: An electrical generator consists of a circular coil of wire of radius 5.00 cm and with 20 turns in…
Q: The idea that most of us “only use 10 percent of our brain” is perhaps the greatest “neuromyth.”…
Q: A person who walks for exercise produces the position-time graph shown. Calculate the average…
Q: What is the difference between resistance and resistivity?
Q: у +9y -t Floe, угодно, у' (одно