6_Ren Time-Series Anomaly Detection Service at Microsoft

.pdf

School

Concordia University *

*We aren’t endorsed by this school

Course

691

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

Uploaded by BaronSandpiperMaster927

Time-Series Anomaly Detection Service at Microsoft Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou ∗ Tony Xing, Mao Yang, Jie Tong, Qi Zhang Microsoft Beijing, China {v-hanren,bix,yujwang,t-chyi,conhua,v-xiko,tonyxin,maoyang,jietong,qizhang}@microsoft.com ABSTRACT Large companies need to monitor various metrics (for example, Page Views and Revenue) of their applications and services in real time. At Microsoft, we develop a time-series anomaly detection ser- vice which helps customers to monitor the time-series continuously and alert for potential incidents on time. In this paper, we intro- duce the pipeline and algorithm of our anomaly detection service, which is designed to be accurate, efficient and general. The pipeline consists of three major modules, including data ingestion, exper- imentation platform and online compute. To tackle the problem of time-series anomaly detection, we propose a novel algorithm based on Spectral Residual (SR) and Convolutional Neural Network (CNN). Our work is the first attempt to borrow the SR model from visual saliency detection domain to time-series anomaly detection. Moreover, we innovatively combine SR and CNN together to im- prove the performance of SR model. Our approach achieves superior experimental results compared with state-of-the-art baselines on both public datasets and Microsoft production data. CCS CONCEPTS • Computing methodologies → Machine learning ; Unsuper- vised learning ; Anomaly detection ; • Mathematics of com- puting → Time series analysis ; • Information systems → Traffic analysis . KEYWORDS anomaly detection; time-series; Spectral Residual ACM Reference Format: Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xi- aoyu Kou and Tony Xing, Mao Yang, Jie Tong, Qi Zhang. 2019. Time- Series Anomaly Detection Service at Microsoft. In The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19), August 4–8, 2019, Anchorage, AK, USA. ACM, New York, NY, USA, 9 pages. https: //doi.org/10.1145/3292500.3330680 ∗ Hansheng Ren is a student in University of Chinese Academy of Sciences; Chao Yi and Xiaoyu Kou are students in Peking University. The work was done when they worked as full-time interns at Microsoft. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. KDD ’19, August 4–8, 2019, Anchorage, AK, USA © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6201-6/19/08...$15.00 https://doi.org/10.1145/3292500.3330680 1 INTRODUCTION Anomaly detection aims to discover unexpected events or rare items in data. It is popular in many industrial applications and is an important research area in data mining. Accurate anomaly detection can trigger prompt troubleshooting, help to avoid loss in revenue, and maintain the reputation and branding for a company. For this purpose, large companies have built their own anomaly detection services to monitor their business, product and service health [ 11 , 20 ]. When anomalies are detected, alerts will be sent to the operators to make timely decisions related to incidents. For instance, Yahoo releases EGADS [ 11 ] to automatically monitor and raise alerts on millions of time-series of different Yahoo properties for various use-cases. At Microsoft, we build an anomaly detection service to monitor millions of metrics coming from Bing, Office and Azure, which enables engineers move faster in solving live site issues. In this paper, we focus on the pipeline and algorithm of our anomaly detection service specialized for time-series data. There are many challenges in designing an industrial service for time-series anomaly detection: Challenge 1: Lack of Labels. To provide anomaly detection services for a single business scenario, the system must process mil- lions of time-series simultaneously. There is no easy way for users to label each time-series manually. Moreover, the data distribution of time-series is constantly changing, which requires the system recognizing the anomalies even though similar patterns have not appeared before. That makes the supervised models insufficient in the industrial scenario. Challenge 2: Generalization. Various kinds of time-series from different business scenarios are required to be monitored. As shown in Figure 1, there are several typical categories of time-series pat- terns; and it is important for industrial anomaly detection services to work well on all kinds of patterns. However, existing approaches are not generalized enough for different patterns. For example, Holt winters [ 5 ] always shows poor results in (b) and (c); and Spot [ 19 ] always shows poor results in (a). Thus, we need to find a solution of better generality. (a) seasonal (b) stable (c) unstable Figure 1: Different types of time-series. arXiv:1906.03821v1 [cs.LG] 10 Jun 2019

Challenge 3: Efficiency. In business applications, a monitor- ing system must process millions, even billions of time-series in near real time. Especially for minute-level time-series, the anom- aly detection procedure needs to be finished within limited time. Therefore, efficiency is one of the major prerequisites for online anomaly detection service. Even though the models with large time complexity are good at accuracy, they are often of little use in an online scenario. To tackle the aforementioned problems, our goal is to develop an anomaly detection approach which is accurate, efficient and general. Traditional statistical models [ 5 , 14 – 17 , 19 , 20 , 24 ] can be easily adopted online, but their accuracies are not sufficient for industrial applications. Supervised models [ 13 , 18 ] are superior in accuracy, but they are insufficient in our scenario because of lacking labeled data. There are other unsupervised approaches, for instance, Luminol [1] and DONUT [23]. However, these methods are either too time-consuming or parameter-sensitive. Therefore, we aim to develop a more competitive method in the unsupervised manner which favors accuracy, efficiency and generality simultaneously. In this paper, we borrow the Spectral Residual model [10] from the visual saliency detection domain to our anomaly detection appli- cation. Spectral Residual (SR) is an efficient unsupervised algorithm, which demonstrates outstanding performance and robustness in the visual saliency detection tasks. To the best of our knowledge, our work is the first attempt to borrow this idea for time-series anomaly detection. The motivation is that the time-series anomaly detection task is similar to the problem of visual saliency detection essen- tially. Saliency is what "stands out" in a photo or scene, enabling our eye-brain connection to quickly (and essentially unconsciously) focus on the most important regions. Meanwhile, when anomalies appear in time-series curves, they are always the most salient part in vision. Moreover, we propose a novel approach based on the combina- tion of SR and CNN. CNN is a state-of-the-art method for supervised saliency detection when sufficient labeled data is available; while SR is a state-of-the-art approach in unsupervised setting. Our inno- vation is to unite these two models by applying CNN on the basis of SR output directly. As the problem of anomaly discrimination be- comes much easier upon the output of SR model, we can train CNN through automatically generated anomalies and achieve significant performance enhancement over the original SR model. Because the anomalies used for CNN training is fully synthetic, the SR-CNN ap- proach remains unsupervised and establishes a new state-of-the-art performance when no manually labeled data is available. As shown in the experiments, our proposed algorithm is more accurate and general than state-of-the-art unsupervised models. Furthermore, we also apply it as an additional feature in the su- pervised learning model. The experimental results demonstrate that the performance can be further improved when labeled data is available; and the additional features do provide complementary information to existing anomaly detectors. Up to the date of pa- per submission, the F 1 -score of our unsupervised and supervised approaches are both the best ever achieved on the open datasets. The contributions of this paper are highlighted as below: • For the first time in the anomaly detection field, we borrow the technique of visual saliency detection to detect anomalies in time-series data. The inspiring results prove the possibil- ity of using computer vision technologies to solve anomaly detection problems. • We combine the SR and CNN model to improve the accuracy of time-series anomaly detection. The idea is innovative and the approach outperforms current state-of-the-art methods by a large margin. Especially, the F 1 -score is improved by more than 20% on Microsoft production data. • From the practical perspective, the proposed solution has good generality and efficiency. It can be easily integrated with online monitoring systems to provide quick alerts for important online metrics. This technique has enabled prod- uct teams to move faster in detecting issues, save manual efforts, and accelerate the process of diagnostics. The rest of this paper is organized as follows. First, in Section 2, we describe the details of system design, including data ingestion, experimentation platform and online compute. Then, we share our experience of real applications in Section 3 and introduce the methodology in Section 4. Experimental results are analyzed in Section 5 and related works are presented in Section 6. Finally, we conclude our work and put forward future work in Section 7. 2 SYSTEM OVERVIEW The whole system consists of three major components: data inges- tion , experimentation platform and online compute . Before going into more detail about these components, we will introduce the whole pipeline first. Users can register monitoring tasks by ingesting time-series to the system. Ingesting time-series from dif- ferent data sources (including Azure storage, databases and online streaming data) is supported. The ingestion worker is responsible for updating each time-series according to the designated granu- larity, for example, minute, hour, or day. Time-series points enter the streaming pipeline through Kafka and is stored into the time- series database. Anomaly detection processor calculates the anomaly status for incoming time-series points online. In a common sce- nario of monitoring business metrics, users ingest a collection of time-series simultaneously. As an example, Bing team ingests the time-series representing the the usage of different markets and plat- forms. When incident happens, alert service combines anomalies of related time-series and sends them to users through emails and paging services. The combined anomalies show the overall status of an incident and help users to shorten the time in diagnosing issues. Figure 2 illustrates the general pipeline of the system. 2.1 Data Ingestion Users can register a monitor task by creating a Datafeed . Each datafeed is identified by Connect String and Granularity . Connect String is used to connect user’s storage system to the anomaly detection service. Granularity indicates the update frequency of a datafeed; and the minimum granularity is one minute. An ingestion task will ingest the data points of time-series to the system accord- ing to the given granularity. For example, if a user sets minute as the granularity, ingestion module will create a task every minute

Figure 2: System Overview to ingest a new data point. Time-series points are ingested into in- fluxDB 1 and Kafka 2 . Throughput of this module varies from 10,000 to 100,000 data points per second. 2.2 Online Compute The online compute module processes each data point immediately after it enters the pipeline. To detect anomaly status of an incoming point, a sliding window of the time-series data points is required. Therefore, we use Flink 3 to manage the points in memory to opti- mize the computation efficiency. Currently, the streaming pipeline processes more than 4 million time-series every day in production. The maximum throughput can be 4 million every minute. Anomaly detection processor detects anomalies for each single time-series. In practice, a single anomaly is not enough for users to diagnose their service efficiently. Thus, smart alert processor correlates the anomalies from difference time-series and generates an incident report accordingly. As anomaly detection is the main topic in this paper, smart alert is not discussed in more detail. 2.3 Experimentation Platform We build an experimentation platform to evaluate the performance of anomaly detection models. Before we deploy a new model, offline experiments and online A/B tests will be conducted on the platform. Users can mark a point as anomaly or not on the portal. A labeling service is provided to human editors. Editors will first label true anomaly points of a single time-series and then label false anomaly points from anomaly detection results of a specific model. Labeled 1 https://www.influxdata.com/ 2 https://kafka.apache.org/ 3 https://flink.apache.org/ data is used to evaluate the accuracy of the anomaly detection model. We also evaluate the efficiency and generality of each model on the platform. In online experiments, we flight several datafeeds to the new model. A couple of metrics, such as click through rate of alerts, percentage of anomalies and false anomaly rate is used to decide whether the new model can be deployed to production. The experimentation platform is built on Azure machine learning service 4 . If a model is verified to be effective, the platform will expose it as a web service and host it on K8s 5 . 3 APPLICATIONS At Microsoft, it is a common need to monitor business metrics and act quickly to address the issue if there is anything outside of the normal pattern. To tackle the problem, we build a scalable system with the ability to monitor minute-level time-series from various data sources. Automated diagnostic insights are provided to assist users to resolve their issues efficiently. The service has been used by more than 200 product teams within Microsoft, across Office 365, Windows, Bing and Azure organizations, with more than 4 million time-series ingested and monitored continuously. As an example, Michael from Bing team would like to monitor the usage of their service in the global marketplace. In the anomaly detection system, he created a new datafeed to ingest thousands of time-series, each indicating the usage of a specific market (US, UK, etc.), device (PC, windows phone, etc.) or channel (PORE, QBRE, etc.). Within 5 minutes, Michael saw the ingested time-series on the portal. At 9am, Oct-14, 2017, the time-series associated to the UK market encountered an incident. Michael was notified through 4 https://azure.microsoft.com/en-us/services/machine-learning-service/ 5 https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/

(a) Alert Page (b) Incident Report Figure 3: An illustration of example application from Microsoft Bing E-mail alerts (as shown in Figure 3(a)) and started to investigate the problem. He opened the incident report where the top correlated time-series with anomalies are selected from a set of time-series around 9am. As shown in Figure 3(b), usage on PC devices and PORE channel can be found in the incident report. Michael brought this insight to the team and finally found that the problem was caused by a relevance issue which made users do lots of pagination requests (PORE) to get satisfactory search results. As another example, the Outlook anti-spam team used to lever- age a rule-based method to monitor the effectiveness of their spam detection system. However, this method was not easy to be main- tained and usually showed bad cases on some Geo-locations. There- fore, they ingested key metrics to our anomaly detection service to monitor the effectiveness of their spam detection model across different Geo-locations. Through our API, they have integrated anomaly detection ability into the Office DevOps platform. By using this automatic detection service, they have covered more Geo-locations and received less false positive cases compared to the original rule-based solution. 4 METHODOLOGY The problem of time-series anomaly detection is defined as below. Problem 1. Given a sequence of real values, i.e., x = x 1 , x 2 , ..., x n , the task of time-series anomaly detection is to produce an output sequence, y = y 1 , y 2 , ..., y n , where y i ∈ { 0 , 1 } denotes whether x i is an anomaly point. As emphasized in the Introduction, our challenge is to develop a general and efficient algorithm with no labeled data. Inspired by the domain of visual computing, we adopt Spectral Residual (SR) [ 10 ], a simple yet powerful approach based on Fast Fourier Transform (FFT) [ 21 ]. The SR approach is unsupervised and has been proved to be efficient and effective in visual saliency detection applications. We believe that the visual saliency detection and time- series anomaly detection tasks are similar essentially, because the anomaly points are usually salient in the visual perspective. Furthermore, recent saliency detection research has shown fa- vor to end-to-end training with Convolutional Neural Networks (CNNs) when sufficient labeled data is available [ 25 ]. Nevertheless, it is prohibitive for our application as large-scale labeled data is difficult to be collected online. As a trade-off, we propose a novel method, SR-CNN, which applies CNN on the output of SR model di- rectly. CNN is responsible to learn a discriminate rule to replace the single threshold adopted by the original SR solution. The problem becomes much easier to learn the CNN model on SR results than on the original input sequence. Specifically, we can use artificially generated anomaly labels to train the CNN-based discriminator. In the following sub-sections, we introduce the details of SR and SR-CNN methods respectively. 4.1 SR (Spectral Residual) The Spectral Residual (SR) algorithm consists of three major steps: (1) Fourier Transform to get the log amplitude spectrum; (2) calcu- lation of spectral residual ; and (3) Inverse Fourier Transform that transforms the sequence back to spatial domain. Mathematically, given a sequence x , we have A ( f ) = Amplitude ( F ( x )) (1) P ( f ) = Phrase ( F ( x )) (2) L ( f ) = lo д ( A ( f )) (3) AL ( f ) = h q ( f ) · L ( f ) (4) R ( f ) = L ( f ) − AL ( f ) (5) S ( x ) = F − 1 ( exp ( R ( f ) + iP ( f ))) (6) where F and F − 1 denote Fourier Transform and Inverse Fourier Transform respectively. x is the input sequence with shape n × 1 ; A ( f ) is the amplitude spectrum of sequence x ; P ( f ) is the corre- sponding phase spectrum of sequence x ; L ( f ) is the log represen- tation of A ( f ) ; and AL ( f ) is the average spectrum of L ( f ) which can be approximated by convoluting the input sequence by h q ( f ) ,

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version