Dmitri Zimine Senior Director, Automation & Integration [email protected] #Stack_Storm Event Driven Automation and Workflows for Auto- remediation © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Dmitri ZimineSenior Director, Automation & Integration
[email protected]#Stack_Storm
Event Driven Automation and Workflows for Auto-remediation
2
About myself
Past • Opalis Software (now aka M$ SC Orchestrator)• VMware• OpenStack Mistral core team member• StackStorm founder & CTOPresent:• Automation and Integration @ Brocade
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
3
Agenda
• Brief History of Event Driven Automation and Workflows• How it works• What can be automated• Workflows - detailed• Workflow based automation vs alternatives
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 4
Automation starts with the workflow
“ Workflow is a set of tasks strung together to achieve some meaningful business
objective “
5
6
Business Process Management
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 7
Apply BPM to IT Automation?The TIBCO Integration Platform
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 8
Hype Cycle for Real-Time Infrastructure, 2008
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 10
BMC
BMC
CA
Cisco
VMware
CitrixOpsWar
eHP
Microsoft
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 11
12
The problem is biggerthan it was 5 years ago
13
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 14
Speed
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 15
Tools
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 16
More Tools…
Still…
• Manual operations• Custom scripts
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 18
Event Driven Automation 2.0FBAR (saving 13,680 hours/day)
Naoru
Nurse
Winston (powered by StackStorm)
Azure Automation
Mistral workflow service
StackStorm automation platform
ACT
OBSERVE
ORIENTDECIDE
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 19
Ingredients
IT DomainsConfig mgmtStorageNetworking ContainersCloud InfraMonitoring
ActionsSensors
WorkflowsRules
Ops Support
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 20
Automation ExampleAutomation
EngineerService
Monitoring IncidentManagement
Event: “low disk on
web301”
Web301 is “low disk”
Resolve known cases, fast. Is it
/var/log? Clean up!
Unknown problem, need a human
Wake up, buddy. Something real
is going on…
What can be automated?• Security checks
– On malware detection in a VM, isolate network port on a switch
• App blue-green deployment
– On Jenkins tests passed, bring new vm claster, deploy and configure app, set loadbalancer to send % of traffic to new app, monitor, roll forward, or back out
• Networking– On BGP peer goes down: collect
troubleshooting data, post on slack & create JIRA ticket
– On Link aggregation member error, check load, if capacity of rest of LAG bundle enough, disable link with error
• OpenStack– orphan VM clean-up: On orphans
detected, shut down, email owner, keep for few days, delete
– VM evacuation on HW failures: On host RAID failure, get list of impacted VMs, email VM owners, evacuate VMs, create JIRA ticket for hardware replacement.
• NFV: – Nokia, Ericson, AT&T, with Mistral
and OpenStack• Service remediation:
– Cassandra “node down” recovery: On ring node dying, deploy new node, configure, add to the ring.
– Remediating RabbitMQ, Galera cluster, MySQL, and more…
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 21
What can be automated?From: Practice of Cloud System Administration, by Thomas Limoncelli
Benefits
• Avoid failures (fixing on computer time, not human time)• Reduce incident MTTR (Mean Time To Recover)• Reduce risk of human error (no fat fingers)• Positive team impact
– Avoid pager fatigue and team burn-out– Turn reactive / proactive vicious cycle– Capture operational knowledge – as code
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 24
Engineer Wakes up
Logs in and ACK
Checksrunbook
Studiesthe alert
Fixes theproblem
Runs diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
FalsePositive
Winston
2:00 AM
2:05 AM
2:05 AM
2:15 AM
AssistedDiagnostics
Fixed theproblem
On-call With Winston
27
Benefits
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Uses event driven automation and workflows with Brocade Workflow Composer to run Virtual Desktop Service
Virus Detection 80% reduction in ops man-hours to detect, isolate and resolve
Adding tenant 70% reduction in man-hours,
Environment Verification
50% time to verify reduced120% verification coverage
Threshold Monitoring 40% decrease incidences caused by lack of resources
Troubleshooting 40% reduced data collection time
Network Troubleshooting (congestion, loops)
80% reduction in man-hours, minimizing operational mistakes
“Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm
Benefits
• Reduce MTR (Mean Time to Resolution)• Avoid failures (fixing on computer time, not human time)• Reduce risk of human error (no fat fingers)• Positive team impact
– Avoid pager fatigue and team burn-out– Turn from reactive to proactive (break reactive vicious cycle)– Capture operational knowledge – as code
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 29
• Workflows• Why not scripts?• Why not legacy workflow management
systems?• Taming your automation
Into Details:
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Workflows
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 32
Workflows
IT DomainsConfig mgmtStorageNetworking ContainersCloud InfraMonitoring
ActionsSensors
WorkflowsRules
Ops Support
MISTRAL
N.B: Event Driven Automation > Workflow,
but Workflow is a key element.
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 33
Key Workflow Patterns
• Theory: ~100 patterns - http://www.workflowpatterns.com/
• Practice: IMAO, only few sufficient for IT & DC automation
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 34
Basic: Sequence ...
tasks:
t1_update_config: action: core.remote_sudo
input:
cmd: sed -i -e"s/keepalive_timeout
hosts: my_webserver.example.com
on-complete: t2_cleanup_logs
t2_cleanup_logs: action: core.remote_sudo
input:
cmd: rm /var/log/nginx/
hosts: my_webserer.example.com
on-complete: t3_restart_service
t3_restart_service: action: core.remote_sudo cmd="servic
t1
t2
t3
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 35
Basic: Data Passingexamples.data_pass:
input:
- host
tasks:
t1_diagnose:
action: diag.run_mysql_diag
input:
host: <% $.host %>
publish: - msg: <% t1_diagnose.stdout.summary %> on-complete: t2_cleanup_logs
t2_post_to_chat:
action: chatops.say
input:
header: Returned <% $.t1_diagnose.code %> details: <% $.msg %>
t1.code=0msg=“Some string..”
t1
t2
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 36
Basic: Conditions tasks:
...
t1_deploy:
action: ops.deploy_fleet
on-success: t2_post_to_chat
on-failure: t3_page_ops
t2_post_to_chat:
action: chatops.say
input:
header: Successfully deployed <% $.t1_diag
t3_page_admin:
action: pagerduty.launch_incident
input:
details: Have to wake up dude...
details: <% $.msg %>
t1
t2
t3
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 37
Basic: Conditions on Data t1_diagnose: action: ops.run_switch_diag publish: - code: <% t1_diagnose.return_code %> on-complete: - t2_post_to_chat: <% $.code == 0 %> - t3_page_network_admin: <% $.code > 0 %>
t2_post_to_chat: action: slack.post input: header: ”Switch <% switch %> checked, OK"
t3_page_network_admin: action: pagerduty.launch_incident input: details: Have to wake up dude... details: <% $.t1_diagnose.stdout %>
t1.code==0
t1.code >0
t1
t2
t3
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Sufficient. But there is more…
That’s the basics!
38
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 39
More: Parallel Execution
t4
... t1_do_build: action: cicd.do_build_and_packages on-success: - t2_test_ubuntu14 - t3_test_fedora20 - t3_test_rhel6
t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14"
t3_test_fedora20: action: cicd.deploy_and_test distro="F20"
t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6"
t4
t1
t3
t2
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 40
More: Join
t1
t5
t4
t3
t2
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 41
More: Join
16 ways to join
t4
t1
t3
t2
t5
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 42
More: Join—Simple Merge
HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/BASIC/WCP5.PHP
... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-success: t5_post_status
t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-success: t5_post_status
t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-success: t5_post_status
t5_post_status: action: chatops.say input: header: Test completed!
Simple Merge
t5t5t5
t2t3t4
t5
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 43
More: Join—AND Join
HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/NEW/WCP33.PHP
Full “AND” Join
... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-success: t5_post_status
t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-success: t5_post_status
t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-success: t5_post_status
t5_tag_release: join: all action: cicd.tag_release
t2t3t4
t5
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 44
More: Join—Discriminator
HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/ADVANCED_BRANCHING/WCP9.PHP
Discriminator
... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-failure: t5_report_and_fail
t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-failure: t5_report_and_fail
t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-failure: t5_report_and_fail
t5_report_and_fail: join: one action: chatops.say header=“FAILURE!” on-complete: fail
t2t3t4
t5
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 45
t2t2
More: Multiple Data ...
t1_get_ip_list: action: myinventory.allocate_ips num=4 publish: - ip_list: <% $.t1_get_ip_list.ips %> on-complete: t2_create_vms
t2_create_vms: with-items: ip in <% $. ip_list %> action: myaws.create_vms ip=<% $.ip %>
t1
t2
ip_list=[...]
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 46
Recap: Key Workflow Operations
• Sequence• Data passing• Conditions (on data)
• Parallel execution• Joins• Multiple Data Items
47
Why not Scripts?
48
Why not Scripts?
• Simple to define, reason, visualize• Transparent
– state is clear, execution is trackable: running, complete, failed steps
• Reliable– Workflows are long-running– Crash tolerance– “Restart from point of failure”
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.49
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 50
Workflows Better in Operations
• Simple to define, reason, visualize• Transparent
– state is clear, execution is trackable: running, complete, failed steps
• Reliable– Workflows are long-running– Crash tolerance– “Restart from point of failure”
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 51
Why not Legacy RunBook Automation?
DevOps:Infrastructure as Code
52
Infrastructure as code
Case Study
• Automated provisioning, 4 Data centers• Before: CPO, operator updates via GUI, click and
pray, x4• After: BWC, dev -> code review -> staging -> QA-
> prod
Infrastructure as code
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.53
Top predictor of IT performance?Version control used by Opsfor Ops artifacts!
Designed for DevOps
1. Support infrastructure as code2. Open Source3. Scale and reliability4. Part of tool chain5. Social coding & collaboration6. More demanding - requires skills
54© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
55
Part of tool chain
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Devops Tools vs Enterprise Suites
OR
Leverage social codingCommunity packs @ StackStorm exchange
57
More demanding
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
OR
Requires skills – CLI, scripting, understanding
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 58
Operation Patterns
Capture and share operational pattersas code!
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.59
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.60
61
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 62
• Event-driven automation works – - benefits to reliable cloud operations
• Automation must be reliable and transparent – - workflows beat scripts
• Infra as code is a key – - repeatable, testable, reliable automation
Summary
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 63
OpenSource Apache 2.0• Github: github.com/StackStorm/st2• Twitter: Stack_Storm• IRC: #stackstorm on FreeNode• stackstorm.slack.com on Slack• www.stackstorm.com
StackStorm Brocade Workflow Composer
Commercial Edition• Enterprise features• Priority support• brocade.com/bwc• docs: bwc-docs.brocade.com• Network lifecycle automation
suite
© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
Questions & Answers
64