Top Banner
Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data Florian Lautenschlager, 1 Michael Philippsen, 2 Andreas Kumlehn, 2 and Josef Adersberger 1 1 QAware GmbH, Munich, Germany 2 University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen Chronix: Long Term Storage and Retrieval Technology for Anomaly Detection in Operational Data Florian Lautenschlager, 1 Michael Philippsen, 2 Andreas Kumlehn, 2 and Josef Adersberger 1 1 QAware GmbH, Munich, Germany 2 University Erlangen-Nürnberg (FAU), Programming Systems Group, Erlangen Abstract Anomalies in the runtime behavior of software systems, especially in distributed systems, are inevitable, expensive, and hard to lo- cate. To detect and correct such anomalies one has to automati- cally collect, store, and analyze the operational data of the run- time behavior, often represented as time series. There are efficient means both to collect and analyze the runtime behavior. But general-purpose time series databases do not focus on the specific needs of anomaly detection. Chronix is a domain specific time series database targeted at anomaly detection in operational data. Detecting Anomalies in Running Software matters Resource consumption: anomalous memory consumption, high CPU usage, . . . Sporadic failure: blocking state, deadlock, dirty read, . . . Security: port scanning activity, short frequent login attempts, . . . Economic or reputation loss. Anomaly Detection Tool Chain for Operational Data Collection Framework Analysis Framework Time Series Database Collects operational data from a running application Asks the database for data and analyzes the data Stores the time series data General Purpose TSDB Brake shoe Resource hog Productivity obstacle Chronix: Domain specific TSDB Domain specific sensors and adaptors Domain specific analysis algorithms and tools Application’s Operational Data Types of operational data: Metrics: scalar values, e.g., rates, runtimes, total hits, counters, … Events: single occurrences, e.g., a user’s login, product order, … Traces: sequences within a software system, e.g., the called methods, … General-purpose TSDBs in Anomaly Detection Requirements Graphite InfluxDB OpenTSDB KairosDB Prometheus Generic data model # G # # G # # Analysis support G # G # # G # G # Lossless long term storage # G # No support for data types = Productivity obstacle No support for analyses = Productivity obstacle + Brake shoe High memory footprint = Performance hog High storage demands = Performance hog Loss of historical data = Brake shoe What makes Chronix domain specific? Option to pre-compute an extra representation of the data Optional timestamp compression for almost-periodic time series Records that meet the needs of the domain Compression technique that suits the domain’s data Underlying multi-dimensional storage Domain specific query language with server-side evaluation Domain specific commissioning of configuration parameters How it works! Example: Almost-periodic time series Timestamp Value Metric Process Host 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC 25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC Optional Pre-compute Extras Timestamp Value Metric Process Host SAX 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC A 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC B 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC C 25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC B B A C B C D C A B Lossless storage that keeps all details as analyses may need them. Programming interface to add extra domain specific "columns". These "columns" speed up anomaly detection queries. Optional Timestamp Compaction Timestamp 25.10 … :01.546 25.10 … :06.718 25.10 … :11.891 25.10 … :16.964 Timestamp 25.10 … :01.546 5.172 5.173 5.073 Timestamp 25.10 … :01.546 5.172 0.001 0.1 Timestamp 25.10 … :01.546 5.172 - - Timestamp 25.10 … :01.546 5.172 space saved space saved Drop diffs below threshold Calculate deltas Compute diffs between them If accumulated drift > threshold store delta Date-Delta-Compaction for almost-periodic time series. Functionally lossless as all relevant details are kept. Degree of inaccuracy is a configuration parameter of Domain Specifc Records Process Host SAX SmartHub QAMUC A SmartHub QAMUC B SmartHub QAMUC C SmartHub QAMUC B 1 Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: Timestamp Value SAX 25.10.2016 00:00:01.546 218.34 A 5.172 218.37 B - 218.49 C - 218.35 B 1 BLOB chunk & convert 2 Timestamp Value Metric 25….:01.546 218.34 ingester\time 5.172 218.37 ingester\time - 218.49 ingester\time - 218.35 ingester\time 1 1 1 2 2 2 Exploit repetitiveness and bundle "lines" into data chunks. Programming interface for a specifc time series record encoding. Chunk size is a configuration parameter of Domain Specific Compression Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: 00105e0 e6b0 343b 9 07bc 0804 e7d508040 Record metric: ingester\time process: SmartHub host: QAMUC start: 25.10.2016 00:00:01.546 end: … type: metric data: Timestamp Value SAX 25.10.2016 00:00:01.546 218.34 A 5.172 218.37 B - 218.49 C - 218.35 B Compressed BLOB serialize & compress Lossless compression techniques minimizes the record size. Domain data often has small increments, recurring patterns, etc. Choice of compression technqiue is a configuration parameter of Multi-Dimensional Storage Timestamp Value Metric Process Host 25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC 25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC 25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC 25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC q=host:QAMUC AND metric:ingester* AND type:[metric OR trace] AND end:NOW-7MONTH Explorative: Users can use the attributes to find a record. Correlating: Queries can use and combine all types. Query Language & Server-Side eval. Basic Graphite InfluxDB OpenTSDB KairosDB Prometheus Chronix distinct × X × × × X integral X × × × × X min/max/sum X X X X X X bottom/top × X × × X X first/last X X × X × X ... ... ... ... ... ... ... nnderivative X X × × × X movavg X X × × × X divide/scale X X × X X X High-level sax [33] × × × × × X fastdtw [38] × × × × × X outlier × × × × × X trend × × × × × X frequency × × × × × X grpsize × × × × × X split × × × × × X query read result process q=metric:ingester* & cf=outlier Basic functions & high-level built-in domain specific functions. Plug-in interface to add functions for server-side evalution. Domain Specific Commissioning 0 10 20 30 40 50 60 70 80 90 0 5 10 50 100 200 1000 DDC threshold in ms Rates in % - Average in ms Inaccuracy Rate Average Deviation Space Reduction 32 34 36 38 40 42 44 46 48 32 64 128 256 512 1024 Chunk size in kBytes Total access time in sec gzip LZ4 Snappy DDC-Threshold: 200 ms. Compression & Chunk Size: gzip + 128 kByte. Easily detect pattern! Fast! Best Values! Select the ideal Compression! Projects 1–3 Remove Jitter! Evaluation Benchmark Client Ubuntu 16.04.1 x64, 12 core, 32 GB Ram, 512 GB SSD Java Benchmark Benchmark Server Ubuntu 16.04.1 x64, 12 core, 32 GB Ram, 512 GB SSD Docker InfluxDB KairosDB Chronix OpenTSDB Queries Time Series HTTP Data of 5 Industry Projects. Project 1 2 3 4 5 total time series 1,080 8,567 4,538 500 24,055 38,740 pairs (mio) metric 2.4 331.4 162.6 3.9 3,762.3 4,262.6 lsof 0.0 0.0 0.0 0.4 0.0 0.4 strace 0.0 0.0 0.0 12.1 0.0 12.1 (a) Pairs and time series per project. Project r 0.5 1 7 14 21 28 56 91 180 1–3 q 15 30 30 10 5 3 1 2 0 96 q 2 11 15 8 12 5 1 2 2 58 4&5 b 1 6 5 7 2 4 4 1 2 32 h 2 6 10 8 6 6 3 2 0 43 (b) Time ranges r (days) and occurrences of queries (q) for raw data retrieval, and of queries with basic (b) and high-level (h) functions. Memory footprint (in MBytes) InfluxDB OpenTSDB KairosDB Chronix Initially 33 2,726 8,763 446 Import (max) 10,336 10,111 18,905 7,002 Query (max) 8,269 9,712 11,230 4,792 Chronix has a 34%–69% smaller memory footprint. Storage demands (in GBytes) Project Raw Data InfluxDB OpenTSDB KairosDB Chronix 4 1.2 0.2 0.2 0.3 0.1 5 107.0 10.7 16.9 26.5 8.6 total 108.2 10.9 17.1 26.8 8.7 Chronix saves 20%–68% of the storage space. Data retrieval times (in s) r q InfluxDB OpenTSDB KairosDB Chronix 0.5 2 4.3 2.8 4.4 0.9 1 11 5.2 5.6 6.6 5.3 7 15 34.1 17.4 26.8 7.0 14 8 36.2 14.2 25.5 4.0 21 12 76.5 29.8 55.0 6.0 28 5 7.9 3.9 5.6 0.5 56 1 35.4 12.4 24.1 1.2 91 2 47.5 15.5 33.8 1.1 180 2 96.7 36.7 66.6 1.1 total 343.8 138.3 248.4 27.1 Chronix saves 80%–92% on data retrieval time. Times for b- and h-queries (in s) Basic (b) InfluxDB OpenTSDB KairosDB Chronix 4 Avg. 0.9 6.1 9.8 4.4 5 Max. 1.3 8.4 9.1 6.0 3 Min. 0.7 2.7 5.3 2.8 3 Dev. 6.7 16.7 21.1 2.3 5 Sum 0.7 6.0 12.0 2.0 4 Count 0.8 5.5 10.5 1.0 8 Perc. 10.2 25.8 34.5 8.6 High-level (h) 12 Outlier 30.7 29.1 117.6 18.9 14 Trend 162.7 50.4 100.6 30.2 11 Freq. 47.3 23.9 45.7 16.3 3 GroupSize 218.9 2927.8 206.3 29.6 3 Split 123.1 2893.9 47.9 37.2 75 total 604.0 5996.3 620.4 159.3 Chronix saves 73%–97% on analysis times. Typical scenario. Important! Needed for exploration! Evaluation: Projects 4&5 Conclusion Chronix exploits the characteristics of the domain in many ways and thus achieves better storage and query results. Chronix is open source. www.chronix.io Acknowledgements This research was in part supported by the Bavarian Ministry of Economic Affairs and Media, Energy and Technology as an IuK-grant for the project DfD – Design for Diagnosability.
1

Chronix Poster for the Poster Session FAST 2017

Apr 11, 2017

Download

Data & Analytics

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Chronix: Long Term Storage and Retrieval Technologyfor Anomaly Detection in Operational Data

    Florian Lautenschlager,1 Michael Philippsen,2 Andreas Kumlehn,2 and Josef Adersberger11QAware GmbH, Munich, Germany 2University Erlangen-Nrnberg (FAU), Programming Systems Group, Erlangen

    Chronix: Long Term Storage and Retrieval Technologyfor Anomaly Detection in Operational Data

    Florian Lautenschlager,1 Michael Philippsen,2 Andreas Kumlehn,2 and Josef Adersberger11QAware GmbH, Munich, Germany 2University Erlangen-Nrnberg (FAU), Programming Systems Group, Erlangen

    AbstractAnomalies in the runtime behavior of software systems, especiallyin distributed systems, are inevitable, expensive, and hard to lo-cate. To detect and correct such anomalies one has to automati-cally collect, store, and analyze the operational data of the run-time behavior, often represented as time series. There are efficientmeans both to collect and analyze the runtime behavior. Butgeneral-purpose time series databases do not focuson the specific needs of anomaly detection. Chronixis a domain specific time series database targeted atanomaly detection in operational data.

    Detecting Anomalies in Running Software matters

    Resource consumption: anomalousmemory consumption, high CPU usage, . . .Sporadic failure: blocking state,deadlock, dirty read, . . .Security: port scanning activity,short frequent login attempts, . . .

    Economic or reputation loss.

    Anomaly Detection Tool Chain for Operational Data

    Collection

    Framework

    Analysis

    FrameworkTime Series Database

    Collects operational data

    from a running application

    Asks the database for data

    and analyzes the dataStores the time series data

    General Purpose TSDB

    Brake shoe

    Resource hog

    Productivity obstacle

    Chronix:

    Domain specific TSDB

    Domain specific sensors

    and adaptors

    Domain specific analysis

    algorithms and tools

    Applications

    Operational

    Data

    Types of operational data:

    Metrics: scalar values, e.g.,

    rates, runtimes, total hits,

    counters,

    Events: single occurrences,

    e.g., a users login, product

    order,

    Traces: sequences within a

    software system, e.g., the

    called methods,

    General-purpose TSDBs in Anomaly Detection

    Requireme

    nts

    Graphite

    InfluxD

    BOpenT

    SDB

    KairosDB

    Prom

    etheus

    Genericdata model # G# # G# #

    Analysissupport G# G# # G# G#

    Lossless longterm storage # G#

    No support for data types= Productivity obstacle

    No support for analyses= Productivity obstacle + Brake shoe

    High memory footprint= Performance hogHigh storage demands= Performance hogLoss of historical data= Brake shoe

    What makes Chronix domain specific?

    Option to pre-compute an extra representation of the data

    Optional timestamp compression for almost-periodic time series

    Records that meet the needs of the domain

    Compression technique that suits the domains data

    Underlying multi-dimensional storage

    Domain specific query language with server-side evaluation

    Domain specific commissioning of configuration parameters

    How it works!

    Example: Almost-periodic time series

    Timestamp Value Metric Process Host

    25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC

    25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC

    25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC

    25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC

    Optional Pre-compute Extras

    Timestamp Value Metric Process Host SAX

    25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC A

    25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC B

    25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC C

    25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC B

    B

    A

    C

    B

    C

    D

    C

    A

    B

    Lossless storage that keeps all details as analyses may need them. Programming interface to add extra domain specific "columns". These "columns" speed up anomaly detection queries.

    Optional Timestamp Compaction

    Timestamp

    25.10 :01.546

    25.10 :06.718

    25.10 :11.891

    25.10 :16.964

    Timestamp

    25.10 :01.546

    5.172

    5.173

    5.073

    Timestamp

    25.10 :01.546

    5.172

    0.001

    0.1

    Timestamp

    25.10 :01.546

    5.172

    -

    -

    Timestamp

    25.10 :01.546

    5.172

    space saved

    space

    saved

    Drop diffs

    below threshold

    Calculate

    deltas

    Compute

    diffs

    between

    them

    If accumulated

    drift > threshold store delta

    Date-Delta-Compaction for almost-periodic time series. Functionally lossless as all relevant details are kept.

    Degree of inaccuracy is a configuration parameter of

    Domain Specifc Records

    Process Host SAX

    SmartHub QAMUC A

    SmartHub QAMUC B

    SmartHub QAMUC C

    SmartHub QAMUC B

    1Record

    metric: ingester\time

    process: SmartHub

    host: QAMUC

    start: 25.10.2016 00:00:01.546

    end:

    type: metric

    data: Timestamp Value SAX

    25.10.2016 00:00:01.546 218.34 A

    5.172 218.37 B

    - 218.49 C

    - 218.35 B

    1

    BLOB

    chunk

    & convert

    2

    Timestamp Value Metric

    25.:01.546 218.34 ingester\time

    5.172 218.37 ingester\time

    - 218.49 ingester\time

    - 218.35 ingester\time

    1

    1 1 2

    22

    Exploit repetitiveness and bundle "lines" into data chunks. Programming interface for a specifc time series record encoding.

    Chunk size is a configuration parameter of

    Domain Specific Compression

    Record

    metric: ingester\time

    process: SmartHub

    host: QAMUC

    start: 25.10.2016 00:00:01.546

    end:

    type: metric

    data: 00105e0 e6b0 343b 9

    07bc 0804 e7d508040

    Record

    metric: ingester\time

    process: SmartHub

    host: QAMUC

    start: 25.10.2016 00:00:01.546

    end:

    type: metric

    data: Timestamp Value SAX

    25.10.2016 00:00:01.546 218.34 A

    5.172 218.37 B

    - 218.49 C

    - 218.35 B Compressed BLOB

    serialize

    & compress

    Lossless compression techniques minimizes the record size. Domain data often has small increments, recurring patterns, etc.

    Choice of compression technqiue is a configuration parameter of

    Multi-Dimensional Storage

    Timestamp Value Metric Process Host

    25.10.2016 00:00:01.546 218.34 ingester\time SmartHub QAMUC

    25.10.2016 00:00:06.718 218.37 ingester\time SmartHub QAMUC

    25.10.2016 00:00:11.891 218.49 ingester\time SmartHub QAMUC

    25.10.2016 00:00:16.964 218.35 ingester\time SmartHub QAMUC

    q=host:QAMUC AND metric:ingester*AND type:[metric OR trace] AND end:NOW-7MONTH

    Explorative: Users can use the attributes to find a record.Correlating: Queries can use and combine all types.

    Query Language & Server-Side eval.

    Basic Graphite

    InfluxDB

    OpenTSDBKairos

    DBProme

    theusChron

    ix

    distinct X Xintegral X X

    min/max/sum X X X X X Xbottom/top X X X

    first/last X X X X. . . . . . . . . . . . . . . . . . . . .

    nnderivative X X Xmovavg X X X

    divide/scale X X X X X

    High-levelsax [33] X

    fastdtw [38] Xoutlier X

    trend Xfrequency X

    grpsize Xsplit X

    queryread

    resultprocess

    q=metric:ingester* & cf=outlier

    Basic functions & high-level built-in domain specific functions. Plug-in interface to add functions for server-side evalution.

    Domain Specific Commissioning

    0102030405060708090

    0 5 10 50 100 200 1000DDC threshold in ms

    Rat

    es in

    %

    Ave

    rage

    in m

    s

    Inaccuracy RateAverage DeviationSpace Reduction

    323436384042444648

    32 64 128 256 512 1024Chunk size in kBytes

    Tota

    l acc

    ess

    time

    in s

    ec

    gzipLZ4Snappy

    DDC-Threshold: 200 ms.Compression & Chunk Size: gzip + 128 kByte.

    Easily detectpattern!

    Fast!

    Best Values!

    Select the

    ideal

    Compressio

    n!

    Projects 13

    Remove

    Jitter!

    EvaluationBenchmark ClientUbuntu 16.04.1 x64, 12 core, 32 GB Ram,

    512 GB SSD

    Java

    Benchmark

    Benchmark ServerUbuntu 16.04.1 x64, 12 core, 32 GB Ram,

    512 GB SSD

    Docker InfluxDB KairosDB

    ChronixOpenTSDB

    Queries

    Time Series

    HTTP

    Data of 5 Industry Projects.Project 1 2 3 4 5 total

    time series 1,080 8,567 4,538 500 24,055 38,740

    pairs

    (mio) metric 2.4 331.4 162.6 3.9 3,762.3 4,262.6

    lsof 0.0 0.0 0.0 0.4 0.0 0.4strace 0.0 0.0 0.0 12.1 0.0 12.1

    (a) Pairs and time series per project.

    Project r 0.5 1 7 14 21 28 56 91 1801 3 q 15 30 30 10 5 3 1 2 0 96

    q 2 11 15 8 12 5 1 2 2 584 & 5 b 1 6 5 7 2 4 4 1 2 32

    h 2 6 10 8 6 6 3 2 0 43

    (b) Time ranges r (days) and occurrences of queries (q) for raw data retrieval,and of queries with basic (b) and high-level (h) functions.

    Memory footprint (in MBytes)

    InfluxD

    B

    OpenT

    SDB

    Kairo

    sDB

    Chron

    ix

    Initially 33 2,726 8,763 446Import (max) 10,336 10,111 18,905 7,002Query (max) 8,269 9,712 11,230 4,792

    Chronix has a 34%69% smaller memory footprint.

    Storage demands (in GBytes)

    Project Raw D

    ata

    InfluxD

    B

    OpenT

    SDB

    Kairo

    sDB

    Chron

    ix

    4 1.2 0.2 0.2 0.3 0.15 107.0 10.7 16.9 26.5 8.6

    total 108.2 10.9 17.1 26.8 8.7

    Chronix saves 20%68% of the storage space.

    Data retrieval times (in s)

    r q InfluxD

    B

    OpenT

    SDB

    Kairo

    sDB

    Chron

    ix

    0.5 2 4.3 2.8 4.4 0.91 11 5.2 5.6 6.6 5.37 15 34.1 17.4 26.8 7.0

    14 8 36.2 14.2 25.5 4.021 12 76.5 29.8 55.0 6.028 5 7.9 3.9 5.6 0.556 1 35.4 12.4 24.1 1.291 2 47.5 15.5 33.8 1.1

    180 2 96.7 36.7 66.6 1.1total 343.8 138.3 248.4 27.1

    Chronix saves 80%92% on data retrieval time.

    Times for b- and h-queries (in s)

    Basic (b) InfluxD

    B

    OpenT

    SDB

    Kairo

    sDB

    Chron

    ix

    4 Avg. 0.9 6.1 9.8 4.45 Max. 1.3 8.4 9.1 6.03 Min. 0.7 2.7 5.3 2.83 Dev. 6.7 16.7 21.1 2.35 Sum 0.7 6.0 12.0 2.04 Count 0.8 5.5 10.5 1.08 Perc. 10.2 25.8 34.5 8.6High-level (h)

    12 Outlier 30.7 29.1 117.6 18.914 Trend 162.7 50.4 100.6 30.211 Freq. 47.3 23.9 45.7 16.33 GroupSize 218.9 2927.8 206.3 29.63 Split 123.1 2893.9 47.9 37.2

    75 total 604.0 5996.3 620.4 159.3

    Chronix saves 73%97% on analysis times.

    Typical scen

    ario.

    Impo

    rtan

    t!

    Nee

    ded f

    or

    explo

    ration

    !

    Evaluation

    :

    Projects

    4 & 5

    ConclusionChronix exploits the characteristics of the domain in many ways andthus achieves better storage and query results. Chronix is open source. www.chronix.io

    AcknowledgementsThis research was in part supported by the Bavarian Ministry of Economic Affairs and Media, Energy andTechnology as an IuK-grant for the project DfD Design for Diagnosability.