Top Banner
Dmitri Zimine Senior Director, Automation & Integration [email protected] #Stack_Storm Event Driven Automation and Workflows for Auto- remediation © 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.
64

Event driven-automation and workflows

Jan 07, 2017

Download

Technology

Dmitri Zimine
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Dmitri ZimineSenior Director, Automation & Integration

[email protected]#Stack_Storm

Event Driven Automation and Workflows for Auto-remediation

Page 2: Event driven-automation and workflows

2

About myself

Past • Opalis Software (now aka M$ SC Orchestrator)• VMware• OpenStack Mistral core team member• StackStorm founder & CTOPresent:• Automation and Integration @ Brocade

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Page 3: Event driven-automation and workflows

3

Agenda

• Brief History of Event Driven Automation and Workflows• How it works• What can be automated• Workflows - detailed• Workflow based automation vs alternatives

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Page 4: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 4

Automation starts with the workflow

“ Workflow is a set of tasks strung together to achieve some meaningful business

objective “

Page 5: Event driven-automation and workflows

5

Page 6: Event driven-automation and workflows

6

Business Process Management

Page 7: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 7

Apply BPM to IT Automation?The TIBCO Integration Platform

Page 8: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 8

Page 9: Event driven-automation and workflows

Hype Cycle for Real-Time Infrastructure, 2008

Page 10: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 10

BMC

BMC

CA

Cisco

VMware

CitrixOpsWar

eHP

Microsoft

Page 11: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 11

Page 12: Event driven-automation and workflows

12

Page 13: Event driven-automation and workflows

The problem is biggerthan it was 5 years ago

13

Page 14: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 14

Speed

Page 15: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 15

Tools

Page 16: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 16

More Tools…

Page 17: Event driven-automation and workflows

Still…

• Manual operations• Custom scripts

Page 18: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 18

Event Driven Automation 2.0FBAR (saving 13,680 hours/day)

Naoru

Nurse

Winston (powered by StackStorm)

Azure Automation

Mistral workflow service

StackStorm automation platform

ACT

OBSERVE

ORIENTDECIDE

Page 19: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 19

Ingredients

IT DomainsConfig mgmtStorageNetworking ContainersCloud InfraMonitoring

ActionsSensors

WorkflowsRules

Ops Support

Page 20: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 20

Automation ExampleAutomation

EngineerService

Monitoring IncidentManagement

Event: “low disk on

web301”

Web301 is “low disk”

Resolve known cases, fast. Is it

/var/log? Clean up!

Unknown problem, need a human

Wake up, buddy. Something real

is going on…

Page 21: Event driven-automation and workflows

What can be automated?• Security checks

– On malware detection in a VM, isolate network port on a switch

• App blue-green deployment

– On Jenkins tests passed, bring new vm claster, deploy and configure app, set loadbalancer to send % of traffic to new app, monitor, roll forward, or back out

• Networking– On BGP peer goes down: collect

troubleshooting data, post on slack & create JIRA ticket

– On Link aggregation member error, check load, if capacity of rest of LAG bundle enough, disable link with error

• OpenStack– orphan VM clean-up: On orphans

detected, shut down, email owner, keep for few days, delete

– VM evacuation on HW failures: On host RAID failure, get list of impacted VMs, email VM owners, evacuate VMs, create JIRA ticket for hardware replacement.

• NFV: – Nokia, Ericson, AT&T, with Mistral

and OpenStack• Service remediation:

– Cassandra “node down” recovery: On ring node dying, deploy new node, configure, add to the ring.

– Remediating RabbitMQ, Galera cluster, MySQL, and more…

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 21

Page 22: Event driven-automation and workflows

What can be automated?From: Practice of Cloud System Administration, by Thomas Limoncelli

Page 23: Event driven-automation and workflows
Page 24: Event driven-automation and workflows

Benefits

• Avoid failures (fixing on computer time, not human time)• Reduce incident MTTR (Mean Time To Recover)• Reduce risk of human error (no fat fingers)• Positive team impact

– Avoid pager fatigue and team burn-out– Turn reactive / proactive vicious cycle– Capture operational knowledge – as code

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 24

Page 25: Event driven-automation and workflows

Engineer Wakes up

Logs in and ACK

Checksrunbook

Studiesthe alert

Fixes theproblem

Runs diagnostics

PagerDuty

Alert

2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM

On-call, Without Automation

Page 26: Event driven-automation and workflows

FalsePositive

Winston

2:00 AM

2:05 AM

2:05 AM

2:15 AM

AssistedDiagnostics

Fixed theproblem

On-call With Winston

Page 27: Event driven-automation and workflows

27

Benefits

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Uses event driven automation and workflows with Brocade Workflow Composer to run Virtual Desktop Service

Virus Detection 80% reduction in ops man-hours to detect, isolate and resolve

Adding tenant 70% reduction in man-hours,

Environment Verification

50% time to verify reduced120% verification coverage

Threshold Monitoring 40% decrease incidences caused by lack of resources

Troubleshooting 40% reduced data collection time

Network Troubleshooting (congestion, loops)

80% reduction in man-hours, minimizing operational mistakes

Page 28: Event driven-automation and workflows

“Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm

Page 29: Event driven-automation and workflows

Benefits

• Reduce MTR (Mean Time to Resolution)• Avoid failures (fixing on computer time, not human time)• Reduce risk of human error (no fat fingers)• Positive team impact

– Avoid pager fatigue and team burn-out– Turn from reactive to proactive (break reactive vicious cycle)– Capture operational knowledge – as code

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 29

Page 30: Event driven-automation and workflows

• Workflows• Why not scripts?• Why not legacy workflow management

systems?• Taming your automation

Into Details:

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Page 31: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Workflows

Page 32: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 32

Workflows

IT DomainsConfig mgmtStorageNetworking ContainersCloud InfraMonitoring

ActionsSensors

WorkflowsRules

Ops Support

MISTRAL

N.B: Event Driven Automation > Workflow,

but Workflow is a key element.

Page 33: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 33

Key Workflow Patterns

• Theory: ~100 patterns - http://www.workflowpatterns.com/

• Practice: IMAO, only few sufficient for IT & DC automation

Page 34: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 34

Basic: Sequence ...

tasks:

t1_update_config: action: core.remote_sudo

input:

cmd: sed -i -e"s/keepalive_timeout

hosts: my_webserver.example.com

on-complete: t2_cleanup_logs

t2_cleanup_logs: action: core.remote_sudo

input:

cmd: rm /var/log/nginx/

hosts: my_webserer.example.com

on-complete: t3_restart_service

t3_restart_service: action: core.remote_sudo cmd="servic

t1

t2

t3

Page 35: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 35

Basic: Data Passingexamples.data_pass:

input:

- host

tasks:

t1_diagnose:

action: diag.run_mysql_diag

input:

host: <% $.host %>

publish: - msg: <% t1_diagnose.stdout.summary %> on-complete: t2_cleanup_logs

t2_post_to_chat:

action: chatops.say

input:

header: Returned <% $.t1_diagnose.code %> details: <% $.msg %>

t1.code=0msg=“Some string..”

t1

t2

Page 36: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 36

Basic: Conditions tasks:

...

t1_deploy:

action: ops.deploy_fleet

on-success: t2_post_to_chat

on-failure: t3_page_ops

t2_post_to_chat:

action: chatops.say

input:

header: Successfully deployed <% $.t1_diag

t3_page_admin:

action: pagerduty.launch_incident

input:

details: Have to wake up dude...

details: <% $.msg %>

t1

t2

t3

Page 37: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 37

Basic: Conditions on Data t1_diagnose: action: ops.run_switch_diag publish: - code: <% t1_diagnose.return_code %> on-complete: - t2_post_to_chat: <% $.code == 0 %> - t3_page_network_admin: <% $.code > 0 %>

t2_post_to_chat: action: slack.post input: header: ”Switch <% switch %> checked, OK"

t3_page_network_admin: action: pagerduty.launch_incident input: details: Have to wake up dude... details: <% $.t1_diagnose.stdout %>

t1.code==0

t1.code >0

t1

t2

t3

Page 38: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Sufficient. But there is more…

That’s the basics!

38

Page 39: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 39

More: Parallel Execution

t4

... t1_do_build: action: cicd.do_build_and_packages on-success: - t2_test_ubuntu14 - t3_test_fedora20 - t3_test_rhel6

t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14"

t3_test_fedora20: action: cicd.deploy_and_test distro="F20"

t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6"

t4

t1

t3

t2

Page 40: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 40

More: Join

t1

t5

t4

t3

t2

Page 41: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 41

More: Join

16 ways to join

t4

t1

t3

t2

t5

Page 42: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 42

More: Join—Simple Merge

HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/BASIC/WCP5.PHP

... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-success: t5_post_status

t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-success: t5_post_status

t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-success: t5_post_status

t5_post_status: action: chatops.say input: header: Test completed!

Simple Merge

t5t5t5

t2t3t4

t5

Page 43: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 43

More: Join—AND Join

HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/NEW/WCP33.PHP

Full “AND” Join

... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-success: t5_post_status

t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-success: t5_post_status

t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-success: t5_post_status

t5_tag_release: join: all action: cicd.tag_release

t2t3t4

t5

Page 44: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 44

More: Join—Discriminator

HTTP://WWW.WORKFLOWPATTERNS.COM/PATTERNS/CONTROL/ADVANCED_BRANCHING/WCP9.PHP

Discriminator

... t2_test_ubuntu14: action: cicd.deploy_and_test distro="UBUNTU14” on-failure: t5_report_and_fail

t3_test_fedora20: action: cicd.deploy_and_test distro="F20" on-failure: t5_report_and_fail

t4_test_rhel6: action: cicd.deploy_and_test distro="RHEL6" on-failure: t5_report_and_fail

t5_report_and_fail: join: one action: chatops.say header=“FAILURE!” on-complete: fail

t2t3t4

t5

Page 45: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 45

t2t2

More: Multiple Data ...

t1_get_ip_list: action: myinventory.allocate_ips num=4 publish: - ip_list: <% $.t1_get_ip_list.ips %> on-complete: t2_create_vms

t2_create_vms: with-items: ip in <% $. ip_list %> action: myaws.create_vms ip=<% $.ip %>

t1

t2

ip_list=[...]

Page 46: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 46

Recap: Key Workflow Operations

• Sequence• Data passing• Conditions (on data)

• Parallel execution• Joins• Multiple Data Items

Page 47: Event driven-automation and workflows

47

Why not Scripts?

Page 48: Event driven-automation and workflows

48

Why not Scripts?

• Simple to define, reason, visualize• Transparent

– state is clear, execution is trackable: running, complete, failed steps

• Reliable– Workflows are long-running– Crash tolerance– “Restart from point of failure”

Page 49: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.49

Page 50: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 50

Workflows Better in Operations

• Simple to define, reason, visualize• Transparent

– state is clear, execution is trackable: running, complete, failed steps

• Reliable– Workflows are long-running– Crash tolerance– “Restart from point of failure”

Page 51: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 51

Why not Legacy RunBook Automation?

DevOps:Infrastructure as Code

Page 52: Event driven-automation and workflows

52

Infrastructure as code

Case Study

• Automated provisioning, 4 Data centers• Before: CPO, operator updates via GUI, click and

pray, x4• After: BWC, dev -> code review -> staging -> QA-

> prod

Page 53: Event driven-automation and workflows

Infrastructure as code

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.53

Top predictor of IT performance?Version control used by Opsfor Ops artifacts!

Page 54: Event driven-automation and workflows

Designed for DevOps

1. Support infrastructure as code2. Open Source3. Scale and reliability4. Part of tool chain5. Social coding & collaboration6. More demanding - requires skills

54© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Page 55: Event driven-automation and workflows

55

Part of tool chain

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Devops Tools vs Enterprise Suites

OR

Page 56: Event driven-automation and workflows

Leverage social codingCommunity packs @ StackStorm exchange

Page 57: Event driven-automation and workflows

57

More demanding

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

OR

Requires skills – CLI, scripting, understanding

Page 58: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 58

Operation Patterns

Capture and share operational pattersas code!

Page 59: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.59

Page 60: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.60

Page 61: Event driven-automation and workflows

61

Page 62: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 62

• Event-driven automation works – - benefits to reliable cloud operations

• Automation must be reliable and transparent – - workflows beat scripts

• Infra as code is a key – - repeatable, testable, reliable automation

Summary

Page 63: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC. 63

OpenSource Apache 2.0• Github: github.com/StackStorm/st2• Twitter: Stack_Storm• IRC: #stackstorm on FreeNode• stackstorm.slack.com on Slack• www.stackstorm.com

StackStorm Brocade Workflow Composer

Commercial Edition• Enterprise features• Priority support• brocade.com/bwc• docs: bwc-docs.brocade.com• Network lifecycle automation

suite

Page 64: Event driven-automation and workflows

© 2016 BROCADE COMMUNICATIONS SYSTEMS, INC.

Questions & Answers

64