Top Banner

of 20

Splunk Live

Apr 06, 2018

Download

Documents

apaperino
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/3/2019 Splunk Live

    1/20

    Splunk Live October 2010

    [email protected]

  • 8/3/2019 Splunk Live

    2/20

    page 2

    Who we are

    Established financial services technology consulting company

    Founded in 2004 by experts in risk management technology

    Exclusive focus on Capital Markets

    Engaged at top-tier international banks and hedge funds

    Offices in NY, London, Bangalore

    www.riskfocusinc.com

    Broad product, functional and technology expertise Expertise translates into common solution patterns which are reused to client benefit

    Products: Credit, Rates, Commodities, FX

    Process: Trade Capture, Valuation, on demand / end of day Valuation / Risk, Enterprise Market / Credit Risk

    FpML: A Common Language for Financial Communication

    Our Approach

    We aim for better, generalized solutions to problem patterns

  • 8/3/2019 Splunk Live

    3/20

    page 3

    Presentation Agenda

    The Enterprise IT Problem

    Challenges of Enterprise Systems

    Splunk Solutions for the whole Software Development Lifecycle Cross-cutting concerns

    Design Release Cycle

    Operations

    Recommendations

  • 8/3/2019 Splunk Live

    4/20

    page 4

    Towers of Hanoi or Tower of Babel?

  • 8/3/2019 Splunk Live

    5/20

    page 5

    The algorithm

    CommonLanguage

    EffectiveCommunication

    Strategic Success

    Clear Message

    Reactive

    S

    plunk

  • 8/3/2019 Splunk Live

    6/20

    page 6

    The architecture

    Common Format

    TransparentConversations

    Robust System

    Message Driven

    Reactive

    S

    plunk

  • 8/3/2019 Splunk Live

    7/20

    page 7

    Unified Operational Intelligence with Splunk

    Capital Markets systems:

    Expensive Complex

    Large operational and support teams

    Maintenance/support lags development initiatives

    Costly downtime

    Maintenance: Preventive is better than Corrective

    Corrective Maintenance: quick and replicable

  • 8/3/2019 Splunk Live

    8/20

    page 8

    EXAMPLE: Fictional Trading System Diagram

  • 8/3/2019 Splunk Live

    9/20

    page 9

    Operational Patterns in Large Systems

    How do we apply behavior across functional components?

    Cross-cutting concerns Apply to all parts, regardless of function

    At application level, often handled via Aspect Oriented Programming:

    Logging

    Performance Profiling

    Security

    Transactionality

    But what about at higher levels?

    This is how the operations team experiences the system

  • 8/3/2019 Splunk Live

    10/20

    page 10

    MessageListener

    Cross cutting at the APPLICATION Level

    NovationHandler

    TradeDAO

    Logging

  • 8/3/2019 Splunk Live

    11/20

    page 11

    ExternalGateway

    Cross cutting at the SYSTEM Level

    Client TradeProcessing

    Log Aggregation

  • 8/3/2019 Splunk Live

    12/20

    page 12

    ValuationSystem

    Cross cutting at the ORGANIZATION Level

    TradingSystem

    Market DataSystem

    Operational Intelligence

  • 8/3/2019 Splunk Live

    13/20

    page 13

    Design

    Problem: The Design Paradox Modular and Distributed are great for design and development

    increased productivity

    improved flexibility

    They make a system look fragmented to the operational teams. Borders are problematic

    Example An issue occurs within one of the components

    This leads to an incident across the border

    The symptoms are observed in a different place at a different time

    Solution Aggregate all logs and cross-index them Create an integrated dashboard

  • 8/3/2019 Splunk Live

    14/20

    page 14

    Dashboard

    See issues by: functional area

    component

    support classification etc.

  • 8/3/2019 Splunk Live

    15/20

    page 15

    Conversation

    Track a problem message across all components

  • 8/3/2019 Splunk Live

    16/20

    page 16

    Release Cycle

    Problem : The Problem Only Occurs in Production (good acronym) Tests passed

    For some reason we only see the problem once the system is live

    Example Exception occurred in QA/UA, but tests passed and no one saw it

    Same problem blew up in Production later

    Solution with Splunk Tag & Categorize events

    Ignorable

    Known (and have recipe for recovery)

    New

    Link to everything: Knowledge Base (e.g. Support Wiki)

    Source Control viewer (FishEye)

    Build Server (TeamCity/Hudson)

    Bug Database (e.g. Jira)

  • 8/3/2019 Splunk Live

    17/20

    page 17

    Root Cause

    Show problem FpML message via ReST Drill through to Support Wiki for solution

  • 8/3/2019 Splunk Live

    18/20

    page 18

    Operations

    Problem: The Non Sequitur Lack of context makes investigation very expensive

    Collaboration frequently means long conference calls

    Example We have a problem. Can you look at it?

    Collaborative effort preceding call is lost

    Inability to correlate events across components and over time

    Inability to look historically. When did the problem appear first?

    Did we just introduce it in this release?

    Solution Just email a Splunk link

    Single entry point for ALL INTELLIGENCE on this problem

    It can be passed around with no loss

  • 8/3/2019 Splunk Live

    19/20

    page 19

    Performance

    Support Email: Sync was slow starting 1pm. Any ideas?

    Useless without Splunk; legitimate with it

    See trends over time, across releases

    Confirm, drill down, resolve

  • 8/3/2019 Splunk Live

    20/20

    page 20

    Recommendations

    Good Design takes into account the whole lifecycle of a System You will be remembered for the failures

    The challenge is Clear Communication. The requirements are Volume, speed, etc

    You CAN have it both ways: clarity does not have to hinder performance Splunk helps

    Design for transparency Optimize for people not machines. Hardware is cheaper Design for the end user Design for the operations team State should be human readable

    Design for scalability Make it faster by adding hardware not by compromising transparency

    Make it faster only after it works and is transparent

    A system chain is only as strong as the weakest link Splunk unifies it all