Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data Florian Lautenschlager, 1 Michael Philippsen, 2 Andreas Kumlehn, 2 and Josef Adersberger 1 1 QAware GmbH, Munich, Germany 2 University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data Florian Lautenschlager, 1 Michael Philippsen, 2 Andreas Kumlehn, 2 and Josef Adersberger 1 1 QAware GmbH, Munich, Germany 2 University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen Abstract Anomalies in the runtime behavior of software systems, especially in distributed systems, are inevitable, expensive, and hard to lo- cate. To detect and correct such anomalies one has to automati- cally collect, store, and analyze the operational data of the run- time behavior, often represented as time series. There are efficient means both to collect and analyze the runtime behavior. But general-purpose time series databases do not focus on the specific needs of anomaly detection. Chronix is a domain specific time series database targeted at anomaly detection in operational data. Detecting Anomalies in Running Software matters • Resource consumption: anomalous memory consumption, high CPU usage, . . . • Sporadic failure: blocking state, deadlock, dirty read, . . . • Security: port scanning activity, short frequent login attempts, . . . → Economic or reputation loss. Anomaly Detection Tool Chain for Operational Data Collection Framework Analysis Framework Time Series Database Collects operational data from a running application Asks the database for data and analyzes the data Stores the time series data General Purpose TSDB • Brake shoe • Resource hog • Productivity obstacle Chronix: Domain specific TSDB Domain specific sensors and adaptors Domain specific analysis algorithms and tools Application’s Operational Data Types of operational data: • Metrics: scalar values, e.g., rates, runtimes, total hits, counters, … • Events: single occurrences, e.g., a user’s login, product order, … • Traces: sequences within a software system, e.g., the called methods, … General-purpose TSDBs in Anomaly Detection Requirements Graphite InfluxDB OpenTSDB KairosDB Prometheus Generic data model # G # # G # # Analysis support G # G # # G # G # Lossless long term storage # G # No support for data types = Productivity obstacle No support for analyses = Productivity obstacle + Brake shoe High memory footprint = Performance hog High storage demands = Performance hog Loss of historical data = Brake shoe What makes Chronix domain specific? Option to pre-compute an extra representation of the data Optional timestamp compression for almost-periodic time series Records that meet the needs of the domain Compression technique that suits the domain’s data Underlying multi-dimensional storage Domain specific query language with server-side evaluation Domain specific commissioning of configuration parameters How it works! Example: Almost-periodic time series Timestamp Value Metric Process Host 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC 25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC … … … … … Optional Pre-compute Extras Timestamp Value Metric Process Host SAX 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC A 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC B 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC C 25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC B … … … … … … B A C B C D C A B • Lossless storage that keeps all details as analyses may need them. • Programming interface to add extra domain specific "columns". • These "columns" speed up anomaly detection queries. Optional Timestamp Compaction Timestamp 25.10 … :01.546 25.10 … :06.718 25.10 … :11.891 25.10 … :16.964 … Timestamp 25.10 … :01.546 5.172 5.173 5.073 … Timestamp 25.10 … :01.546 5.172 0.001 0.1 … Timestamp 25.10 … :01.546 5.172 - - … Timestamp 25.10 … :01.546 5.172 … space saved space saved Drop diffs below threshold Calculate deltas Compute diffs between them If accumulated drift > threshold store delta • Date-Delta-Compaction for almost-periodic time series. • Functionally lossless as all relevant details are kept. • Degree of inaccuracy is a configuration parameter of Domain Specifc Records Process Host SAX SmartHub QAMUC A SmartHub QAMUC B SmartHub QAMUC C SmartHub QAMUC B … … … 1 Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: Timestamp Value SAX 25.10.2016 00:00:01.546 218.34 A 5.172 218.37 B - 218.49 C - 218.35 B 1 BLOB chunk & convert 2 Timestamp Value Metric 25….:01.546 218.34 ingester\time 5.172 218.37 ingester\time - 218.49 ingester\time - 218.35 ingester\time … … … 1 1 1 2 2 2 • Exploit repetitiveness and bundle "lines" into data chunks. • Programming interface for a specifc time series record encoding. • Chunk size is a configuration parameter of Domain Specific Compression Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: 00105e0 e6b0 343b 9 07bc 0804 e7d508040 Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: Timestamp Value SAX 25.10.2016 00:00:01.546 218.34 A 5.172 218.37 B - 218.49 C - 218.35 B Compressed BLOB serialize & compress • Lossless compression techniques minimizes the record size. • Domain data often has small increments, recurring patterns, etc. • Choice of compression technqiue is a configuration parameter of Multi-Dimensional Storage Timestamp Value Metric Process Host 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC 25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC … … … … … q=host:QAMUC AND metric:ingester* AND type:[metric OR trace] AND end:NOW-7MONTH • Explorative: Users can use the attributes to find a record. • Correlating: Queries can use and combine all types. Query Language & Server-Side eval. Basic Graphite InfluxDB OpenTSDB KairosDB Prometheus Chronix distinct × X × × × X integral X × × × × X min/max/sum X X X X X X bottom/top × X × × X X first/last X X × X × X ... ... ... ... ... ... ... nnderivative X X × × × X movavg X X × × × X divide/scale X X × X X X High-level sax [33] × × × × × X fastdtw [38] × × × × × X outlier × × × × × X trend × × × × × X frequency × × × × × X grpsize × × × × × X split × × × × × X query read result process q=metric:ingester* & cf=outlier • Basic functions & high-level built-in domain specific functions. • Plug-in interface to add functions for server-side evalution. Domain Specific Commissioning 0 10 20 30 40 50 60 70 80 90 0 5 10 50 100 200 1000 DDC threshold in ms Rates in % - Average in ms Inaccuracy Rate Average Deviation Space Reduction 32 34 36 38 40 42 44 46 48 32 64 128 256 512 1024 Chunk size in kBytes Total access time in sec gzip LZ4 Snappy • DDC-Threshold: 200 ms. • Compression & Chunk Size: gzip + 128 kByte. Easily detect pattern! Fast! Best Values! Select the ideal Compression! Projects 1–3 Remove Jitter! Evaluation Benchmark Client Ubuntu 16.04.1 x64, 12 core, 32 GB Ram, 512 GB SSD Java Benchmark Benchmark Server Ubuntu 16.04.1 x64, 12 core, 32 GB Ram, 512 GB SSD Docker InfluxDB KairosDB Chronix OpenTSDB Queries Time Series HTTP Data of 5 Industry Projects. Project 1 2 3 4 5 total time series 1,080 8,567 4,538 500 24,055 38,740 pairs (mio) metric 2.4 331.4 162.6 3.9 3,762.3 4,262.6 lsof 0.0 0.0 0.0 0.4 0.0 0.4 strace 0.0 0.0 0.0 12.1 0.0 12.1 (a) Pairs and time series per project. Project r 0.5 1 7 14 21 28 56 91 180 1–3 q 15 30 30 10 5 3 1 2 0 96 q 2 11 15 8 12 5 1 2 2 58 4&5 b 1 6 5 7 2 4 4 1 2 32 h 2 6 10 8 6 6 3 2 0 43 (b) Time ranges r (days) and occurrences of queries (q) for raw data retrieval, and of queries with basic (b) and high-level (h) functions. Memory footprint (in MBytes) InfluxDB OpenTSDB KairosDB Chronix Initially 33 2,726 8,763 446 Import (max) 10,336 10,111 18,905 7,002 Query (max) 8,269 9,712 11,230 4,792 Chronix has a 34%–69% smaller memory footprint. Storage demands (in GBytes) Project Raw Data InfluxDB OpenTSDB KairosDB Chronix 4 1.2 0.2 0.2 0.3 0.1 5 107.0 10.7 16.9 26.5 8.6 total 108.2 10.9 17.1 26.8 8.7 Chronix saves 20%–68% of the storage space. Data retrieval times (in s) r q InfluxDB OpenTSDB KairosDB Chronix 0.5 2 4.3 2.8 4.4 0.9 1 11 5.2 5.6 6.6 5.3 7 15 34.1 17.4 26.8 7.0 14 8 36.2 14.2 25.5 4.0 21 12 76.5 29.8 55.0 6.0 28 5 7.9 3.9 5.6 0.5 56 1 35.4 12.4 24.1 1.2 91 2 47.5 15.5 33.8 1.1 180 2 96.7 36.7 66.6 1.1 total 343.8 138.3 248.4 27.1 Chronix saves 80%–92% on data retrieval time. Times for b- and h-queries (in s) Basic (b) InfluxDB OpenTSDB KairosDB Chronix 4 Avg. 0.9 6.1 9.8 4.4 5 Max. 1.3 8.4 9.1 6.0 3 Min. 0.7 2.7 5.3 2.8 3 Dev. 6.7 16.7 21.1 2.3 5 Sum 0.7 6.0 12.0 2.0 4 Count 0.8 5.5 10.5 1.0 8 Perc. 10.2 25.8 34.5 8.6 High-level (h) 12 Outlier 30.7 29.1 117.6 18.9 14 Trend 162.7 50.4 100.6 30.2 11 Freq. 47.3 23.9 45.7 16.3 3 GroupSize 218.9 2927.8 206.3 29.6 3 Split 123.1 2893.9 47.9 37.2 75 total 604.0 5996.3 620.4 159.3 Chronix saves 73%–97% on analysis times. Typical scenario. Important! Needed for exploration! Evaluation: Projects 4&5 Conclusion Chronix exploits the characteristics of the domain in many ways and thus achieves better storage and query results. Chronix is open source. www.chronix.io Acknowledgements This research was in part supported by the Bavarian Ministry of Economic Affairs and Media, Energy and Technology as an IuK-grant for the project DfD – Design for Diagnosability.