Fast and Scalable Knowledge Bundle Replication in Splunk ...

© 2 0 1 9 S P L U N K IN C .

Fast and Scalable Knowledge Bundle Replication in Splunk Enterprise 8.0

October 23rd 2019,The Venetian Sands Expo, Las Vegas

© 2 0 1 9 S P L U N K IN C .

Fast and Scalable Knowledge Bundle Replication in Splunk Enterprise 8.0

Software Engineer | SplunkAditya Dhoke

Software Engineer | SplunkAnish Shrigondekar

During the course of this presentation, we may make forward‐looking statements regarding future events or plans of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results may differ materially. The forward-looking statements made in the this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, it may not contain current or accurate information. We do not assume any obligation to update any forward‐looking statements made herein.

In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionalities described or to include any such feature or functionality in a future release.

Splunk, Splunk>, Turn Data Into Doing, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2019 Splunk Inc. All rights reserved.

Forward-LookingStatements

© 2 0 1 9 S P L U N K IN C .

© 2 0 1 9 S P L U N K IN C .

IntroductionWhat are knowledge bundles and why do they matter ?

© 2 0 1 9 S P L U N K IN C .

Knowledge BundlesCreated on search-head tier

Composed of knowledge objects• lookups• datamodels• tags• alerts …

Search-head sends bundles to search peers

Search peer runs searches on behalf of search-head using relevant knowledge bundle

Splunk Distributed Search 101

© 2 0 1 9 S P L U N K IN C .

Classic Knowledge Bundle ReplicationHow does knowledge bundle replication work today ?

© 2 0 1 9 S P L U N K IN C .

Classic Bundle ReplicationSearch Head sends bundle to all Search Peers

Search Head

Indexer Indexer Indexer Indexer Indexer Indexer Indexer Indexer IndexerIndexer

Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle

payload

Searchnum_threads = 4

© 2 0 1 9 S P L U N K IN C .

Problem Statement Could we do better?

© 2 0 1 9 S P L U N K IN C .

What you told us…Voice of the customer

Replication could be slow in large deployments

WAN latency impacts performance

Search Head could potentially become a

bottleneck

© 2 0 1 9 S P L U N K IN C .

Cascading Knowledge Bundle Replication Fast and scalable

© 2 0 1 9 S P L U N K IN C .

Cascading Bundle Replication

Ultra-fast performance

Easy to configure and manage

Site and deployment aware

© 2 0 1 9 S P L U N K IN C .

Cascading Bundle ReplicationSearch Head only sends bundle to some Search Peers

Search Head

Indexer Indexer Indexer Indexer Indexer Indexer Indexer Indexer IndexerIndexer

Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle

payloadplan

Search

1. Send cascade plan to all peers

2. Search Head sends payload to designated receivers

3. Search Peers send payload to designated receivers

num_threads = 4

© 2 0 1 9 S P L U N K IN C .

TerminologyCascade Plan (Control Plane)• Topology aware execution plan generated on search-head• Sent to all peers from search-head• Stored in JSON format on all nodes

Cascade Payload (Data Plane)• Actual payload packets sent by sender to set of local

receivers based on cascade plan• Search-peer can act as sender as well as receiver

Knowledge Bundle Replication Cycle• Cycle triggered by search-head composed of target peers

and bundle• Composed of peers receiving full bundle as well as delta

bundle• One cascade plan per bundle type

Terms & Definition

REST: /services/replication/cascading/plans

© 2 0 1 9 S P L U N K IN C .

Cascading Plan Generation

Calculate Depth

Calculate Width

Group Classification

Peer Selection

Topology Assignment

Site and Topology Aware Search Head

Indexer A

Site 1

Indexer D

Site 2

Indexer B

Site 1

Indexer C

Site 1

Indexer E

Site 2

Indexer F

Site 2

Logical overlay tree

Depth = 2Width = 2Groups Site 1 and Site 2

© 2 0 1 9 S P L U N K IN C .

Fault Tolerance

Cascading replication policy built with high resiliency and fault tolerant design

State about all peers maintained on the Search-Head

Bundle state informed through periodic heartbeat from Search-Head to Indexers

Search-head responsible for attempting retry

Bundle replication activity blocked if active replication cycle is in progress

What if things go wrong?

© 2 0 1 9 S P L U N K IN C .

Fault Tolerance

Topology Changes• Peer addition• Peer deletion• Peer restarted

Bundle Failures

Bundle Stuck

Duplicate/Late Delivery

What all could possibly go wrong ? Like literally…

SH

SP

SP SP

SP

SP SP

© 2 0 1 9 S P L U N K IN C .

Performance Analysis

Average Reduction in use of WAN

bandwidth

Average Improvement in Replication Time

Average Reduction in CPU time on Search Head

Lets come to the numbers…

61% 67% 75%

© 2 0 1 9 S P L U N K IN C .

Performance AnalysisReplication Time with Bundle Indexing

Bundle Size = 50MB• 120 indexers – 79% faster• 1000 indexers – 97% faster

Bundle Size = 1GB• 120 indexers – 36% faster• 1000 indexers – 85% faster

© 2 0 1 9 S P L U N K IN C .

Performance AnalysisReplication Time Without Bundle Indexing

If bundle indexing (lookups) is excluded, we get even better performance with cascading bundle replication

For e.g. – with Bundle Size = 1GB• replication time with indexing in cascading mode – 37% faster• replication time without indexing in cascading mode – 70% faster

© 2 0 1 9 S P L U N K IN C .

System MetricsResource Usage Analysis

SH Metrics Classic Cascading Comparison

CPU 300%~500% 100% Cascading policy consumes lower cpu

Memory 300MB 300MB Similar

IO Read 0 0 Similar

IO Write 200~2000 200~2000 Similar

Network Recv 0 0 Similar

Network Sent 40MBps~50MBps 10MBps Cascading policy consumes lower network bandwidth

Search Peer Resource UsageSP Metrics Classic Cascading Comparison

CPU 100% 100%~400% Cascading policy consumes more cpu

Memory 300MB~2000MB 300MB~2000MB Similar

IO Read 0 0 Similar

IO Write 500~2500 500~3600 Similar

Network Recv 10MBps 10MBps Similar

Network Sent 0 10MBps~30MBps Cascading policy consumes higher network bandwidth

Search Head Resource Usage

© 2 0 1 9 S P L U N K IN C .

ConfigurationDeploying in production

distsearch.conf• Applicable on: Search Head• Requires restart: Yes• Preferred mode of deployment in Search Head

Cluster: Deployer

[replicationSettings]

replicationPolicy = cascading

server.conf• Applicable on: Search Peer• Requires restart: Yes• Preferred mode of deployment in Indexer

Cluster: Cluster Master Bundle Push

[cascading_replication]pass4SymmKey = <secret>

© 2 0 1 9 S P L U N K IN C .

REST & CLIFeature Visibility

Bundle Replication Configuration• REST

/services/search/distributed/bundle/replication/config

• CLI./splunk show bundle-replication-config

Bundle Replication Status• REST

/services/search/distributed/bundle/replication/cycles

• CLI./splunk show bundle-replication-status

© 2 0 1 9 S P L U N K IN C .

MonitoringNew Dashboards in Splunk Distributed Monitoring Console (DMC)• Configure in Distributed Mode• Click on Search->Knowledge Bundle

Replication

New Metrics in metrics.log• bundle_metadata• cycle_dispatch• peer_dispatch

Telemetry collected for better diagnosis and supportability

How do I observe the feature ?

© 2 0 1 9 S P L U N K IN C .

1. Improved search experience with highly performant (avg 67% faster) and fault tolerant knowledge bundle replication in cascading mode

2. Simple turn-key configuration with automatic site and deployment awareness

3. Fine grained monitoring and visibility into knowledge bundle replication than ever before

3 things you should take home with you…

Key Takeaways

© 2 0 1 9 S P L U N K IN C .

DEMO

© 2 0 1 9 S P L U N K IN C .

Q&A

RATE THIS SESSIONGo to the .conf19 mobile app to

© 2 0 1 9 S P L U N K IN C .

You!Thank

Fast and Scalable Knowledge Bundle Replication in Splunk ...

Documents