Top Banner
20160408 Debugging Distributed Systems Donny Nadolny [email protected] SREcon16
30

Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny [email protected] SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

Jul 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08

Debugging Distributed SystemsDonny Nadolny

[email protected]

SREcon16

Page 2: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08

Page 3: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Distributed system for building distributed systems • Small in-memory filesystem

What is ZooKeeper

Page 4: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Distributed locking • Consistent, highly available

ZooKeeper at PagerDuty

Page 5: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Network trouble, one follower falls behind • ZooKeeper gets stuck - leader still up

The Failure

11

2

2

1.5

Page 6: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Leader logs: “Too busy to snap, skipping”

Fault Injection Finding

Page 7: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Leader logs: “Too busy to snap, skipping”

• Disk slow? let’s test: • sshfs donny@some_server:/home/donny /mnt

• Similar failure profile

Fault Injection Finding

Page 8: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Leader logs: “Too busy to snap, skipping”

• Disk slow? let’s test: • sshfs donny@some_server:/home/donny /mnt

• Similar failure profile • Re-examine disk latency… nope, was a red herring

Fault Injection Finding

Page 9: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• First warning: application monitoring

• ZooKeeper: used ruok

• Added deep health check

Deep Health Checks

Page 10: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

"LearnerHandler-/123.45.67.89:45874" prio=10 tid=0x00000000024bb800 nid=0x3d0d runnable [0x00007fe6c3193000]

java.lang.Thread.State: RUNNABLE at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:113)

at org.apache.jute.BinaryOutputArchive.writeBuffer(BinaryOutputArchive.java:118) …

at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)

at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1115) - locked <0x00000000d4cd9e28> (a org.apache.zookeeper.server.DataNode) at org.apache.zookeeper.server.DataTree.serializeNode(DataTree.java:1130)

at org.apache.zookeeper.server.ZKDatabase.serializeSnapshot(ZKDatabase.java:467) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:493)

The Stack Trace

1

2

3

Page 11: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

void serializeNode(OutputArchive output, String path) { DataNode node = getNode(path); String[] children = {}; synchronized (node) { output.writeString(path, "path"); output.writeRecord(node, "node"); children = node.getChildren(); } for (String child : children) { serializeNode(output, path + "/" + child); }}

Write Snapshot Code (simplified)

Blocking network write

Page 12: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Why didn’t a follower take over? • ZK heartbeat: message from leader to follower, follower times out

ZooKeeper Heartbeat

Page 13: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016-04-12MAKING PAGERDUTY MORE RELIABLE USING PXC

TCP

Page 14: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

TCP Data Transmission

Follower LeaderESTABLISHED ESTABLISHED

Packet 1

ACK

… SYN, SYN-ACK, ACK …

Page 15: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

TCP Data Transmission

Follower LeaderESTABLISHED ESTABLISHEDPacket 1

Packet 1 ~200ms

Packet 1 ~200ms

~400msPacket 1

~800ms

~

~120sec

Packet 1

Packet 1~120sec

CLOSED

15 retries…

Page 16: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Retransmission timeout (RTO) is based on latency • TCP_RTO_MIN = 200 ms

• TCP_RTO_MAX = 2 minutes • /proc/sys/net/ipv4/tcp_retries2 = 15 retries • 0.2 + 0.2 + 0.4 + 0.8 + … + 120 = 924.8 seconds (15.5 mins)

TCP Retransmission (Linux Defaults)

Page 17: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

TCP Close Connection

Follower LeaderESTABLISHED ESTABLISHED

FIN/ACK

FIN

ACK

LAST_ACK

CLOSED

TIME_WAIT

CLOSED

60 seconds

FIN_WAIT1

Page 18: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

TCP Close Connection

Follower LeaderESTABLISHED ESTABLISHED

CLOSED~1m40s

FIN_WAIT1 FINFINFIN

FIN

FIN

8 retries ~

Page 19: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

TCP Close Connection

Follower LeaderESTABLISHED ESTABLISHED

CLOSED~1m40s

FIN_WAIT1 FIN Packet 1

CLOSED~15.5 mins

Page 20: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

TCP Close Connection

Follower LeaderESTABLISHED ESTABLISHED

CLOSED~1m40s

FIN_WAIT1 FIN Packet 1

CLOSEDRST

Page 21: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• 06:51:47 iptables: WARN: IN=eth0 OUT= MAC=00:0d:12:34:56:78:12:34:56:78:12:34:56:78 SRC=<leader_ip> DST=<follower_ip> LEN=54 TOS=0x00 PREC=0x00 TTL=44 ID=36370 DF PROTO=TCP SPT=3888 DPT=36416 WINDOW=227 RES=0x00 ACK PSH URGP=0

syslog - Dropped Packets on Follower

Page 22: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

iptables -A INPUT -p tcp --dport 80 -j ACCEPT

... more rules to accept connections …

iptables -A INPUT -j DROP

iptables

Page 23: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

iptables -A INPUT -p tcp --dport 80 -j ACCEPT

... more rules to accept connections …

iptables -A INPUT -j DROP

But: iptables connections != netstat connections

iptables

Page 24: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• From linux/net/netfilter/nf_conntrack_proto_tcp.c:

• [TCP_CONNTRACK_LAST_ACK] = 30 SECS

conntrack Timeouts

Page 25: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

TCP Close Connection

Follower Leader

CLOSED

~51.2s

FIN_WAIT1 FINFINFIN

FIN

FIN~25.6s

kernel TCPconntrackLAST_ACK

30s

30s

30s

30s

CLOSED

~12.8s

30s

~81.2s~102.4s

Page 26: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Packet loss • Follower falls behind, requests snapshot

• (Packet loss continues) follower closes connection • Follower conntrack forgets connection

• Leader now stuck for ~15 mins, even if network heals

The Full Story

Page 27: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016-04-12MAKING PAGERDUTY MORE RELIABLE USING PXC

Lessons

Page 28: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Don’t lock and block • TCP can block for a really long time • Interfaces / abstract methods make analysis harder

Lesson 1

Page 29: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08DEBUGGING DISTRIBUTED SYSTEMS

• Leader/follower heartbeats should be deep health checks!

Lesson 2

Page 30: Debugging Distributed Systems - USENIX · Debugging Distributed Systems Donny Nadolny donny@pagerduty.com SREcon16. DEBUGGING DISTRIBUTED SYSTEMS 2016−04−08. DEBUGGING DISTRIBUTED

2016−04−08

Questions? [email protected]

Link: “Network issues can cause cluster to hang due to near-deadlock” https://issues.apache.org/jira/browse/ZOOKEEPER-2201