7_He17_Identifying Impactful Service System Problems via Log Analysis

.pdf

School

Concordia University *

*We aren’t endorsed by this school

Course

691

Subject

Electrical Engineering

Date

Oct 30, 2023

Type

pdf

Pages

Uploaded by BaronSandpiperMaster927

Identifying Impactful Service System Problems via Log Analysis Shilin He ∗† The Chinese University of Hong Kong Hong Kong, China slhe@cse.cuhk.edu.hk Qingwei Lin Microsoft Research Beijing, China qlin@microsoft.com Jian-Guang Lou Microsoft Research Beijing, China jlou@microsoft.com Hongyu Zhang The University of Newcastle NSW, Australia hongyu.zhang@newcastle.edu.au Michael R. Lyu ∗ The Chinese University of Hong Kong Hong Kong, China lyu@cse.cuhk.edu.hk Dongmei Zhang Microsoft Research Beijing, China dongmeiz@microsoft.com ABSTRACT Logs are often used for troubleshooting in large-scale software sys- tems. For a cloud-based online system that provides 24/7 service, a huge number of logs could be generated every day. However, these logs are highly imbalanced in general, because most logs indicate normal system operations, and only a small percentage of logs reveal impactful problems. Problems that lead to the decline of sys- tem KPIs (Key Performance Indicators) are impactful and should be fixed by engineers with a high priority. Furthermore, there are var- ious types of system problems, which are hard to be distinguished manually. In this paper, we propose Log3C, a novel clustering-based approach to promptly and precisely identify impactful system prob- lems, by utilizing both log sequences (a sequence of log events) and system KPIs. More specifically, we design a novel cascading clustering algorithm, which can greatly save the clustering time while keeping high accuracy by iteratively sampling, clustering, and matching log sequences. We then identify the impactful prob- lems by correlating the clusters of log sequences with system KPIs. Log3C is evaluated on real-world log data collected from an online service system at Microsoft, and the results confirm its effectiveness and efficiency. Furthermore, our approach has been successfully applied in industrial practice. CCS CONCEPTS • Software and its engineering → Software testing and debug- ging ; Maintaining software ; KEYWORDS Log Analysis, Problem Identification, Clustering, Service Systems ∗ Also with Shenzhen Research Institute, The Chinese University of Hong Kong. † Work done mainly during internship at Microsoft Research Asia. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5573-5/18/11...$15.00 https://doi.org/10.1145/3236024.3236083 ACM Reference Format: Shilin He, Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, Michael R. Lyu, and Dongmei Zhang. 2018. Identifying Impactful Service System Problems via Log Analysis. In Proceedings of the 26th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engi- neering (ESEC/FSE ’18), November 4–9, 2018, Lake Buena Vista, FL, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3236024.3236083 1 INTRODUCTION For large-scale software systems, especially cloud-based online ser- vice systems such as Microsoft Azure, Amazon AWS, Google Cloud, high service quality is vital. Since these systems provide services to hundreds of millions of users around the world, a small service problem could lead to great revenue loss and user dissatisfaction. Large-scale software systems usually generate logs to record system runtime information (e.g., states and events). These logs are frequently utilized in the maintenance and diagnosis of systems. When a failure occurs, inspecting recorded logs has become a com- mon practice. Particularly, logs play a crucial role in the diagnosis of modern cloud-based online service systems, where conventional debugging tools are hard to be applied. Clearly, manual problem diagnosis is very time-consuming and error-prone due to the increasing scale and complexity of large-scale systems. Over the years, a stream of methods based on machine learning have been proposed for log-based problem identification and troubleshooting. Some use supervised methods, such as classi- fication algorithms [ 43 ], to categorize system problems. However, they require a large number of labels and substantial manual label- ing effort. Others use unsupervised methods, such as PCA [ 41 ] and Invariants Mining [ 23 ] to detect system anomalies. However, these approaches can only recognize whether there is a problem or not but cannot distinguish among different types of problem. To identify different problem types, clustering is the most perva- sive method [ 7 – 9 , 21 ]. However, it is hard to develop an effective and efficient log-based problem identification approach through clustering due to the following three challenges: 1) First, large-scale online service systems such as those of Mi- crosoft and Amazon, often run on a 7 × 24 basis and support hun- dreds of millions of users, which yields an incredibly large quantity of logs. For instance, a service system of Microsoft that we studied can produce dozens of Terabytes of logs per day. Notoriously, con- ducting conventional clustering on data of such order-of-magnitude 60

ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA S. He, Q. Lin, J. Lou, H. Zhang, M. R. Lyu, D. Zhang consumes a great deal of time, which is unacceptable in practice [ 1 , 12 , 15 , 18 ]. 2) Second, there are many types of problems associated with the logs and clustering alone cannot determine whether a cluster reflects a problem or not. In previous work on log clustering, de- velopers are required to verify the problems manually during the clustering process [ 21 ], which is tedious and time-consuming. 3) Third, log data is highly imbalanced. In a production envi- ronment, a well-deployed online service system operates normally most of the time. That is, most of the logs record normal operations and only a small percentage of logs are problematic and indicate impactful problems. The imbalanced data distribution can severely impede the accuracy of the conventional clustering algorithm [ 42 ]. Furthermore, it is intrinsic that some problems may arise less fre- quently than others; therefore, these rare problems emerge with fewer log messages. As a result, it is challenging to identify all problem types from the highly imbalanced log data. To tackle the above challenges, we propose a novel problem identification framework, Log3C, using both log data and system KPI data. System KPIs (Key Performance Indicators such as service availability, average request latency, failure rate, etc.) are widely adopted in industry. They measure the health status of a system over a time period and are collected periodically. To be specific, we propose a novel clustering algorithm, Cas- cading Clustering, which clusters a massive amount of log data by iteratively sampling, clustering, and matching log sequences (sequences of log events). Cascading clustering can significantly reduce the clustering time while keeping high accuracy. Further, we analyze the correlation between log clusters and system KPIs. By in- tegrating the C ascading C lustering and C orrelation analysis , Log3C can promptly and precisely identify impactful service problems. We evaluate our approach on real-world log data collected from a deployed online service system at Microsoft. The results show that our method can accurately find impactful service problems from large log datasets with high time performance. Log3C can precisely find out problems with an average precision of 0.877 and an average recall of 0.883. We have also successfully applied Log3C to the maintenance of many actual online service systems at Microsoft. To summarize, our main contributions are threefold: • We propose Cascading Clustering, a novel clustering algorithm that can greatly save the clustering time while keeping high accuracy. The implementation is available on Github 1 . • We propose Log3C, which is a novel framework that integrates cascading clustering and correlation analysis. Log3C can auto- matically identify impactful problems from a large amount of log and KPI data efficiently and accurately. • We evaluate our method using the real-world data from Microsoft. Besides, we have also applied Log3C to the actual maintenance of online service systems at Microsoft. The results confirm the usefulness of Log3C in practice. The rest of this paper is organized as follows: In Section 2 , we introduce the background and motivation. Section 3 presents the proposed framework and each procedure in detail. The evaluation of our approach is described in Section 4 . Section 5 discusses the experiment results and Section 6 shares some success stories and 1 https://github.com/logpai/Log3C 02 Leaving Monitored Scope (EnsureListItemsData) Execution Time=52.9013 07 HTTP request URL: http://AAA:1000/BBBB/sitedata.html 05 HTTP request URL: /55/RST/UVX/ADEG/Lists/Files/docXX.doc 03 HTTP request URL: /14/Emails/MrX(MrX@mail.com)/1c-48f0-b29.eml 01 Name=Request (GET:http://AAA:1000/BBBB/sitedata.html) 08 Leaving Monitored Scope (Request (POST:http://AAA:100/BBBB/ sitedata.html)) Execution Time=334.319268903038 04 HTTP Request method: GET 06 Overridden HTTP request method: GET E1 Name=Request (*) E3 HTTP Request method: * E5 Overridden HTTP request method: * E4 HTTP request URL: * Log Parsing E2 Leaving Monitored Scope (*) Execution Time = * t_41bx0 t_51xi4 t_23hl3 t_41bx0 t_01mu1 t_41bx0 t_41bx0 t_41bx0 (Task_ID) Figure 1: An Example of Log Messages and Log Events experiences obtained from industrial practice. The related work and conclusion are presented in Section 7 and Section 8 , respectively. 2 BACKGROUND AND MOTIVATION Cloud-based online service systems, such as Microsoft Azure, Google Cloud, and Amazon AWS, have been widely adopted in the industry. These systems provide a variety of services and support a myriad of users across the world every day. Therefore, one system problem could cause catastrophic consequences. Thus far, service providers have made tremendous efforts to maintain high service quality. For example, Amazon AWS [ 2 ] and Microsoft Azure [ 25 ] claim to have "five nines", which indicates the service availability of 99 . 999% . Although a lot of efforts have been devoted to quality assurance, in practice, online service systems still encounter many problems. To diagnose the problem, engineers often rely on system logs, which record system runtime information (e.g., states and events). The top frame of Figure 1 shows eight real-world log messages from Microsoft (some fields are omitted for simplicity of presenta- tion). Each log message comprises two parts: a constant part and a variable part. The constant part consists of fixed text strings, which describe the semantic meaning of a program event. The variable part contains parameters (e.g., URL) that record important system attributes. A log event is the abstraction of a group of similar log messages. As depicted in Figure 1 , the log event for log message 3,5,7 is E4: "HTTP request URL: ∗ " , where the constant part is the common part of these log messages ( "HTTP request URL:" ), and the asterisk represents the parameter part. Log parsing is the procedure that extracts log events from log messages, and we defer details to Section 3.1 . A log sequence is a sequence of log events that record a system operation in the same task. In Figure 1 , log message 1,4,6,7,8 are sequentially generated to record a typical HTTP request. These log messages share the same task ID (t_41bx0), and thereby the corresponding log sequence is: [E1, E3, E5, E4, E2]. For a well-deployed online service system, it operates normally in most cases and exhibits problems occasionally. However, it does not imply that problems are easy to identify. On the contrary, problems are hidden among a vast number of logs while most logs record the system’s normal operations. In addition, there are various types of service problems, which may manifest different patterns, occur at different frequencies, and affect the service system in different 61

Identifying Impactful Service System Problems via Log Analysis ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Log sequence types 10 0 10 1 10 2 10 3 10 4 10 5 #Occurrence Figure 2: Long Tail Distribution of Log Sequences manners. As a result, it is challenging to precisely and promptly identify the service problems from the logs. As an example, Figure 2 shows the long tail distribution of 18 types of log sequences (in logarithmic scale for easy plotting), which are labeled by engineers from product teams. The first two types of log sequences occupy more than 99 . 8% of the total occurrences ("head") and are generated by normal system operations. The re- maining ones indicate different problems, but they all together only take up less than 0 . 2% of all occurrences ("long tail"). Besides, the occurrences of distinct problem types varies significantly. For ex- ample, the first type of problem (the 3rd bar in Figure 2 ) is a "SQL connection problem", which shows that the server cannot connect a SQL database. The most frequent problem occurs over 100 times more often than the least frequent one. The distribution is highly imbalanced and exhibits strong long-tail property, which poses challenges for log-based problem identification. Among all the problems, some are impactful because they can lead to the degradation of system KPIs. As aforementioned, sys- tem KPIs delineate the system’s health status. A lower KPI value indicates that some system problems may have occurred and the service quality deteriorates. In our work, we leverage both log and KPI data to guide the identification of impactful problems. In prac- tice, systems continuously generate logs, but the KPI values are periodically collected. We use time interval to denote the KPI collection frequency. The value of time interval is typically 1 hour or more, which is set by the production team. In our setting, we use failure rate as the KPI, which is the ratio of failed requests to all requests within a time interval. In each time interval, there could be many logs but only one KPI value (e.g., one failure rate). 3 LOG3C: THE PROPOSED APPROACH In this paper, we aim at solving the following problems: Given sys- tem logs and KPIs, how to detect impactful service system problems automatically? How to identify different kinds of impactful service system problems precisely and promptly? To this end, we propose Log3C, whose overall framework is depicted in Figure 3 . Log3C consists of four steps: log parsing, se- quence vectorization, cascading clustering, and correlation analysis. In short, at each time interval, logs are parsed into log events and vectorized into sequence vectors, which are then grouped into mul- tiple clusters through cascading clustering. However, we still cannot extrapolate whether a cluster is an impactful problem, which ne- cessitates the use of KPIs. Consequently, in step four, we correlate clusters and KPIs over different time intervals to find impactful problems. More details are presented in the following sections. 3.1 Log Parsing As aforementioned, log parsing extracts the log event for each raw log message since raw log messages contain some superfluous in- formation (e.g., file name, IP address) that can hinder the automatic log analysis. The most straightforward way of log parsing is to write a regular expression for every logging statement in the source code, as adopted in [ 41 ]. However, it is tedious and time-consuming because the source code updates very frequently and is not always available in practice (e.g., third-party libraries). Thus, automatic log parsing without source code is imperative. In this paper, we use an automatic log parsing method proposed in [ 13 ] to extract log events. Following this method, firstly, some common parameter fields (e.g., IP address), are removed using reg- ular expressions. Then, log messages are clustered into coarse- grained groups based on weighted edit distance. These groups are further split into fine-grained groups of log messages. Finally, a log event is obtained by finding the longest common substrings for each group of raw log messages. To form a log sequence, log messages that share the same task ID are linked together and parsed into log events. Moreover, we re- move the duplicate events in the log sequence. Generally, repetition often indicates retrying operations or loops, such as continuously trying to connect to a remote server. Without removing duplicates, similar log sequences with different occurrences of the same event are identified as distinct sequences, although they essentially indi- cate the same system behavior/operation. Following the common practice [ 21 , 32 ] in log analysis, we remove the duplicate log events. 3.2 Sequence Vectorization After obtaining log sequences from logs in all time intervals, we compute the vector representation for each log sequence. We be- lieve that different log events have different discriminative power in problem identification. As delineated in Step 2 of Figure 3 , to mea- sure the importance of each event, we calculate the event weight by combining the following two techniques: IDF Weighting: IDF (Inverse Document Frequency) is widely utilized in text mining to measure the importance of words in some documents, which lowers the weight of frequent words while in- creasing rare words’ weight [ 30 , 31 ]. In our scenario, events that frequently appear in numerous log sequences cannot distinguish problems well because problems are relatively rare. Hence, the event and log sequence are analogous to word and document re- spectively. We aggregate log sequences in all time intervals together to calculate the IDF weight, which is defined in Equation 1, where N is the total number of all log sequences and n e is the number of log sequences that contain the event e . With IDF weighting, frequent events have low weights, while rare events are weighted high. w idf ( e ) = log N n e (1) 62

ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA S. He, Q. Lin, J. Lou, H. Zhang, M. R. Lyu, D. Zhang 1. Log Parsing t ₁ : t d : [E1, E2, E4, E5] [E2, E3, E4, E5] [E1, E2, E3, E5, E4] [E2, E3, E4, E5] [E2, E1, E5, E3, E6] [E1, E2, E5, E4] [E1, E2, E4, E5] [E3, E4, E6, E5] [E1, E2, E3, E5] 2. Sequence Vectorization 1 1 0 1 1 0 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 1 1 1 1 1 1 0 1 0 2 3 2 3 3 0 2 3 2 2 3 1 2 2 2 2 3 1 α ·Norm(w(idf)) + (1- α ) ·w(cor) 0.48 0.64 0.78 KPIs ... 3. Cascading Clustering 4. Correlation Analysis 25 17 69 5 18 12 107 4 23 23 89 9 0.48 0.64 0.78 KPIs ... ... ... C1 C2 C3 C4 ... ... KPIs Cluster Size E1 E2 E3 E4 E5 E6 ... Sum ... ... ... ... ... ... C1 C2 C3 C4 Clusters: ... ... t ₂ : t ₁ : t d : t ₂ : t ₁ : t d : t ₂ : t d : t ₂ : t ₁ : Figure 3: Overall Framework of Log3C w ( e ) = α ∗ Norm ( w idf ( e )) + ( 1 − α ) ∗ w cor ( e ) (2) Importance Weighting: In problem identification, it is intu- itive that events strongly correlate with KPI degradation are more critical and should be weighted more. Therefore, we build a re- gression model between log events and KPI values to find the im- portance weight. To achieve so, as shown Figure 3 , in each time interval, we sum the occurrence of each event in all log sequences (three in the example) as a summary sequence vector. After that, we get d summary sequence vectors, and d KPI values are also available as aforementioned. Then, a multivariate linear regression model is applied to evaluate the correlation between log events and KPIs. The weights w cor ( e ) obtained from the regression model serve as the importance weights for log events e . Note that the regression model only aims to find the importance weight for the log event. As denoted in Equation 2, the final event weight is the weighted sum of IDF weight and importance weight. Besides, we use Sigmoid function [ 40 ] to normalize the IDF weight into the range of [ 0 , 1 ] . Since the importance weight is directly associated with KPIs and is thereby more effective in problem identification, we value the importance weight more, i.e., α < 0 . 5 . In our experiments, we empirically set α to 0.2. Given the final event weights, the weighted sequence vectors can be easily obtained. For simplicity, hereafter, we use "sequence vectors" to refer to "weighted sequence vectors". Note that each log sequence has a corresponding sequence vector. 3.3 Cascading Clustering Once all log sequences are vectorized, we group sequence vectors into clusters separately for each time interval. However, the conven- tional clustering methods are incredibly time-consuming when the data size is large [ 1 , 12 , 15 , 18 ] because distances between any pair of samples are required. As mentioned in Section 2 , log sequences follow the long tail distribution and are highly imbalanced. Based on the observation, we propose a novel clustering algorithm, cas- cading clustering , to group sequence vectors into clusters (different log sequence types) promptly and precisely, where each cluster represents one kind of log sequence (system behavior). Figure 4 depicts the procedure of cascading clustering, which leverages iterative processing, including sampling, clustering, match- ing and cascading. The input of cascading clustering is all the se- quence vectors in a time interval, and the output is a number of clusters. To be more specific, we first sample a portion of sequence vectors, on which a conventional clustering method (e.g., hierar- chical clustering) is applied to generate multiple clusters. Then, a pattern can be extracted from each cluster. In the matching step, we match all the original unsampled sequence vectors with the patterns to determine their cluster. Those unmatched sequence vectors are collected and fed into the next iteration. By iterating these processes, all sequence vectors can be clustered rapidly and accurately. The reason behind is that large clusters are separated from the remaining data at the first several iterations. 3.3.1 Sampling. Given numerous sequence vectors in each time interval, we first sample a portion of them through Simple Random Sampling (SRS). Each sequence vector has an equal probability p (e.g., 0 . 1% ) to be selected. Suppose there are N sequence vectors in the input data, then the sampled data size is M = ⌈ p ∗ N ⌉ . After sampling, log sequence types (clusters) that dominate in the original input data are still dominant in the sampled data. 3.3.2 Clustering. After sampling M sequence vectors from the input data, we group these sequence vectors into multiple clusters and extract a representative vector (pattern) from every cluster. To do so, we calculate the distance between every two sequence vectors and apply an ordinary clustering algorithm. Distance Metric: During clustering, we use Euclidean distance as the distance metric, which is defined in Equation 3: u and v are two sequence vectors, and n is the vector length, which is the number of log events. u i and v i are the i -th value in vector u and v , respectively. d ( u , v ) = p ∥ u − v ∥ = v t n i = 1 ( u i − v i ) 2 (3) D ( A , B ) = max { d ( a , b ) , ∀ a ∈ A , ∀ b ∈ B } (4) µ = min { d ( x , P j ) , ∀ j ∈ { 1 , 2 , ..., k }} (5) Clustering Technique: We utilize Hierarchical Agglomerative Clustering (HAC) to conduct clustering. At first, each sequence vector itself forms a cluster, and the closest two clusters are merged into a new one. To find the closest clusters, we use the complete linkage [ 38 ] to measure the cluster distance. As shown in Equation 4, D is the cluster distance between two clusters A and B , which is defined as the longest distance between any two elements (one in each cluster) in the clusters. The merging process continues until reaching a distance threshold of θ . That is, the clustering stops when all the distances between clusters are larger than θ . In Section 4.4 , we also study the effect of different thresholds. After clustering, similar sequence vectors are grouped into the same cluster, while dissimilar sequence vectors are separated into different clusters. Pattern Extraction: After clustering, a representative vector is extracted for each cluster, which serves as the pattern of a group of similar log sequences. To achieve so, we compute the mean 63

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version