© 2019 SPLUNK INC. Fast and Scalable Knowledge Bundle Replication in Splunk Enterprise 8.0 October 23 rd 2019, The Venetian Sands Expo, Las Vegas
© 2 0 1 9 S P L U N K IN C .
Fast and Scalable Knowledge Bundle Replication in Splunk Enterprise 8.0
October 23rd 2019,The Venetian Sands Expo, Las Vegas
© 2 0 1 9 S P L U N K IN C .
Fast and Scalable Knowledge Bundle Replication in Splunk Enterprise 8.0
Software Engineer | SplunkAditya Dhoke
Software Engineer | SplunkAnish Shrigondekar
During the course of this presentation, we may make forward‐looking statements regarding future events or plans of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results may differ materially. The forward-looking statements made in the this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, it may not contain current or accurate information. We do not assume any obligation to update any forward‐looking statements made herein.
In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionalities described or to include any such feature or functionality in a future release.
Splunk, Splunk>, Turn Data Into Doing, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2019 Splunk Inc. All rights reserved.
Forward-LookingStatements
© 2 0 1 9 S P L U N K IN C .
© 2 0 1 9 S P L U N K IN C .
Knowledge BundlesCreated on search-head tier
Composed of knowledge objects• lookups• datamodels• tags• alerts …
Search-head sends bundles to search peers
Search peer runs searches on behalf of search-head using relevant knowledge bundle
Splunk Distributed Search 101
© 2 0 1 9 S P L U N K IN C .
Classic Knowledge Bundle ReplicationHow does knowledge bundle replication work today ?
© 2 0 1 9 S P L U N K IN C .
Classic Bundle ReplicationSearch Head sends bundle to all Search Peers
Search Head
Indexer Indexer Indexer Indexer Indexer Indexer Indexer Indexer IndexerIndexer
Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle
payload
Searchnum_threads = 4
© 2 0 1 9 S P L U N K IN C .
What you told us…Voice of the customer
Replication could be slow in large deployments
WAN latency impacts performance
Search Head could potentially become a
bottleneck
© 2 0 1 9 S P L U N K IN C .
Cascading Bundle Replication
Ultra-fast performance
Easy to configure and manage
Site and deployment aware
© 2 0 1 9 S P L U N K IN C .
Cascading Bundle ReplicationSearch Head only sends bundle to some Search Peers
Search Head
Indexer Indexer Indexer Indexer Indexer Indexer Indexer Indexer IndexerIndexer
Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle Bundle
payloadplan
Search
1. Send cascade plan to all peers
2. Search Head sends payload to designated receivers
3. Search Peers send payload to designated receivers
num_threads = 4
© 2 0 1 9 S P L U N K IN C .
TerminologyCascade Plan (Control Plane)• Topology aware execution plan generated on search-head• Sent to all peers from search-head• Stored in JSON format on all nodes
Cascade Payload (Data Plane)• Actual payload packets sent by sender to set of local
receivers based on cascade plan• Search-peer can act as sender as well as receiver
Knowledge Bundle Replication Cycle• Cycle triggered by search-head composed of target peers
and bundle• Composed of peers receiving full bundle as well as delta
bundle• One cascade plan per bundle type
Terms & Definition
REST: /services/replication/cascading/plans
© 2 0 1 9 S P L U N K IN C .
Cascading Plan Generation
Calculate Depth
Calculate Width
Group Classification
Peer Selection
Topology Assignment
Site and Topology Aware Search Head
Indexer A
Site 1
Indexer D
Site 2
Indexer B
Site 1
Indexer C
Site 1
Indexer E
Site 2
Indexer F
Site 2
Logical overlay tree
Depth = 2Width = 2Groups Site 1 and Site 2
© 2 0 1 9 S P L U N K IN C .
Fault Tolerance
Cascading replication policy built with high resiliency and fault tolerant design
State about all peers maintained on the Search-Head
Bundle state informed through periodic heartbeat from Search-Head to Indexers
Search-head responsible for attempting retry
Bundle replication activity blocked if active replication cycle is in progress
What if things go wrong?
© 2 0 1 9 S P L U N K IN C .
Fault Tolerance
Topology Changes• Peer addition• Peer deletion• Peer restarted
Bundle Failures
Bundle Stuck
Duplicate/Late Delivery
What all could possibly go wrong ? Like literally…
SH
SP
SP SP
SP
SP SP
© 2 0 1 9 S P L U N K IN C .
Performance Analysis
Average Reduction in use of WAN
bandwidth
Average Improvement in Replication Time
Average Reduction in CPU time on Search Head
Lets come to the numbers…
61% 67% 75%
© 2 0 1 9 S P L U N K IN C .
Performance AnalysisReplication Time with Bundle Indexing
Bundle Size = 50MB• 120 indexers – 79% faster• 1000 indexers – 97% faster
Bundle Size = 1GB• 120 indexers – 36% faster• 1000 indexers – 85% faster
© 2 0 1 9 S P L U N K IN C .
Performance AnalysisReplication Time Without Bundle Indexing
If bundle indexing (lookups) is excluded, we get even better performance with cascading bundle replication
For e.g. – with Bundle Size = 1GB• replication time with indexing in cascading mode – 37% faster• replication time without indexing in cascading mode – 70% faster
© 2 0 1 9 S P L U N K IN C .
System MetricsResource Usage Analysis
SH Metrics Classic Cascading Comparison
CPU 300%~500% 100% Cascading policy consumes lower cpu
Memory 300MB 300MB Similar
IO Read 0 0 Similar
IO Write 200~2000 200~2000 Similar
Network Recv 0 0 Similar
Network Sent 40MBps~50MBps 10MBps Cascading policy consumes lower network bandwidth
Search Peer Resource UsageSP Metrics Classic Cascading Comparison
CPU 100% 100%~400% Cascading policy consumes more cpu
Memory 300MB~2000MB 300MB~2000MB Similar
IO Read 0 0 Similar
IO Write 500~2500 500~3600 Similar
Network Recv 10MBps 10MBps Similar
Network Sent 0 10MBps~30MBps Cascading policy consumes higher network bandwidth
Search Head Resource Usage
© 2 0 1 9 S P L U N K IN C .
ConfigurationDeploying in production
distsearch.conf• Applicable on: Search Head• Requires restart: Yes• Preferred mode of deployment in Search Head
Cluster: Deployer
[replicationSettings]
replicationPolicy = cascading
server.conf• Applicable on: Search Peer• Requires restart: Yes• Preferred mode of deployment in Indexer
Cluster: Cluster Master Bundle Push
[cascading_replication]pass4SymmKey = <secret>
© 2 0 1 9 S P L U N K IN C .
REST & CLIFeature Visibility
Bundle Replication Configuration• REST
/services/search/distributed/bundle/replication/config
• CLI./splunk show bundle-replication-config
Bundle Replication Status• REST
/services/search/distributed/bundle/replication/cycles
• CLI./splunk show bundle-replication-status
© 2 0 1 9 S P L U N K IN C .
MonitoringNew Dashboards in Splunk Distributed Monitoring Console (DMC)• Configure in Distributed Mode• Click on Search->Knowledge Bundle
Replication
New Metrics in metrics.log• bundle_metadata• cycle_dispatch• peer_dispatch
Telemetry collected for better diagnosis and supportability
How do I observe the feature ?
© 2 0 1 9 S P L U N K IN C .
1. Improved search experience with highly performant (avg 67% faster) and fault tolerant knowledge bundle replication in cascading mode
2. Simple turn-key configuration with automatic site and deployment awareness
3. Fine grained monitoring and visibility into knowledge bundle replication than ever before
3 things you should take home with you…
Key Takeaways