Top Banner
Bugs From Outer Space While42 SF chapter #6
41

Bugs from Outer Space | while42 SF #6

May 09, 2015

Download

Technology

While42

Presentation by Jerome Petazzoni for while42 San Francisco #6 at Kwarter
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bugs from Outer Space | while42 SF #6

Bugs

From Outer Space

While42 — SF chapter — #6

Page 2: Bugs from Outer Space | while42 SF #6
Page 3: Bugs from Outer Space | while42 SF #6

Why this talk?

Codito, ergo erro

I code, therefore I make mistakes

Page 4: Bugs from Outer Space | while42 SF #6

Outline

I'll show some really nasty bugs,

tell stories of unglorious battles.

(Some of which I've actually fought!)

Featuring: Node.js, EC2, LXC, pseudo-

terminals

and also: hardware bugs, dangerous bugs...

Page 5: Bugs from Outer Space | while42 SF #6

Our files,

Node.js is truncating

them!

It all starts with an angry customer.

“Sometimes, downloading this 700 KB JSON

file will fail, because it’s truncated!”

But… Do you even Content-Length?

(The client library should scream, but it

doesn’t.)

Page 6: Bugs from Outer Space | while42 SF #6

Gotta Sniff Some Packets

Log into the load balancer (running Hipache)...

# ngrep -tipd any -Wbyline '/api/v1/download-all-the-things' tcp port 80

interface: any

filter: (ip or ip6) and ( tcp port 80 )

match: /api/v1/download-all-the-things

####

T 2013/08/22 04:11:27.848663 23.20.88.251:55983 -> 10.116.195.150:80 [AP]

GET /api/v1/download-all-the-things.json HTTP/1.0.

Host: angrystartup.com

X-Forwarded-Port: 443.

X-Forwarded-For: ::ffff:24.13.146.16.

X-Forwarded-Proto: https.

...

Page 7: Bugs from Outer Space | while42 SF #6

Ngrep Doesn’t Cut It.

FETCH THE

WIRESHARKS!

# tcpdump -peni any -s0 -wdump tcp port 80

(Wait a bit)

^C

Transfer dump file

DEMO TIME!

Page 8: Bugs from Outer Space | while42 SF #6
Page 9: Bugs from Outer Space | while42 SF #6

What did we find out?

Truncated files happen because a chunk

(probably exactly one) gets dropped.

Impossible to reproduce locally.

Only the customer sees the problem.

THE PLOT THICKENS.

GET YOUR SWIMSUITS,

WE’RE DIVING INTO CODE!

Page 10: Bugs from Outer Space | while42 SF #6

This is Node.js.I have no idea

what I’m doing.

Add console.log() statements in Hipache.

Add console.log() statements in node-http-

proxy.

Add console.log() statements in node/lib/http.js.

The latter didn’t work.

“Fix”: replace require(‘http’) with require(‘_http’)

and add our own _http.js to our node_modules.

Do the same to net.js (in “our” _http.js).

Now analyze an endless stream of obscure events.

Page 11: Bugs from Outer Space | while42 SF #6

It’s all in the pauses

Backend sends lots of data to Hipache.

Hipache sends data to client, but client is slow.

Hipache “pauses” the backend stream.

(i.e. stops reading from the network socket.)

When the client has read enough data,

Hipache “resumes” the stream.

etc.

SO FAR, SO GOOD

Page 12: Bugs from Outer Space | while42 SF #6

It’s all in the awkward

……………………...pauses

There are two layers in Node: tcp and http.

When the tcp layer reads the last chunk,

the socket is closed by the backend.

The tcp layer notices, and sends an “end”

event.

The “end” event causes the “http” layer to finish

what it was doing, without sending a

“resume”.

As a result, some chunks remain in the buffers

of the tcp layer. Lost in space. Forever alone.

Page 13: Bugs from Outer Space | while42 SF #6

How do we fix this?

Pester Node.js folks

Catch that “end” event, and when it happens,

send a “resume” to the stream to drain it.

(Implementation detail: you only have the http

socket, and you need to listen for an event on

the tcp socket, so you need to do slightly dirty

things with the http socket. But eh, it works!)

Page 14: Bugs from Outer Space | while42 SF #6
Page 15: Bugs from Outer Space | while42 SF #6

What did we learn?

When you can’t reproduce a bug at will, record

it in action (tcpdump) and dissect it

(wireshark).

Spraying code with print statements helps.

(But it’s better to use the logging framework!)

You don’t have to know Node.js to fix Node.js!

Page 16: Bugs from Outer Space | while42 SF #6

Hardware has bugs, too

Pentium FDIV bug (1994):

errors at 4th decimal place

Pentium F00F bug (1997):

using the wrong instruction hangs the machine

ATA transfer speeds vary when you touch

ribbon cables (SATA introduced in 2003)

Page 17: Bugs from Outer Space | while42 SF #6

A story of Go, PTYs, LXC:

It never works the first time # docker run -t -i ubuntu echo hello world

2013/08/06 23:20:53 Error: Error starting container 06d642aae1a:

fork/exec /usr/bin/lxc-start: operation not permitted

# docker run -t -i ubuntu echo hello world

hello world

# docker run -t -i ubuntu echo hello world

hello world

# docker run -t -i ubuntu echo hello world

hello world

# docker run -t -i ubuntu echo hello world

hello world

Page 18: Bugs from Outer Space | while42 SF #6
Page 19: Bugs from Outer Space | while42 SF #6

Strace to the rescue!

Steps:

1. Boot the machine.

2. Find pid of process to analyze.

(ps|grep, pidof docker...)

3. “strace -o log -f -p $PID”

4. “docker run -t -i run ubuntu echo hello world”

5. Ctrl-C the strace process.

6. Repeat steps 3-4-5, using a different log file.

Note: can also strace directly, e.g. “strace ls”.

Page 20: Bugs from Outer Space | while42 SF #6

Let’s compare the log files

Thousands and thousands of lines.

Look for the error message.

(e.g. “operation not permitted”)

Other approach: start from the end, and try to

find the point when things started to diverge.

That’s why we have dual 30” monitors.

Page 21: Bugs from Outer Space | while42 SF #6

Investigation results

First time [pid 1331] setsid() = 1331

[pid 1331] dup2(10, 0) = 0

[pid 1331] dup2(10, 1) = 1

[pid 1331] dup2(10, 2) = 2[pid 1331] ioctl(0,

TIOCSCTTY) = -1 EPERM (Operation not permitted)[pid 1331]

write(12, "\1\0\0\0\0\0\0\0", 8) = 8

[pid 1331] _exit(253) = ?

Second time (and every following attempt) [pid 1414] setsid() = 1414

[pid 1414] dup2(14, 0) = 0

[pid 1414] dup2(14, 1) = 1

[pid 1414] dup2(14, 2) = 2[pid 1414] ioctl(0,

TIOCSCTTY) = 0[pid 1414] execve("/usr/bin/lxc-start", ["lxc-

start", "-n", ...]) <...>

Page 22: Bugs from Outer Space | while42 SF #6

What does that mean?

For some reason, some part of the code wants

file descriptor 0 (that’s stdin) to be a terminal.

The first time we run, it fails, but in the process,

we acquire a terminal. (UNIX 101: when you don’t have a controlling terminal and open a file

which is a terminal, it becomes your controlling terminal, unless you

open the file with flag O_NOCTTY)

Next attempts are therefore successful.

Page 23: Bugs from Outer Space | while42 SF #6

… Really?

To confirm that this is indeed the bug:

● start the process with “setsid”

(which detaches from the controlling

terminal)

and see that the bug is back;

● check the output of “ps” (it shows controlling

terminals) and see that indeed, before the

first execution, we didn’t have a controlling

terminal, and we have one after!

23083 ? Sl+ 0:12 ./docker -d -b br0

Page 24: Bugs from Outer Space | while42 SF #6

How to fix the bug?

¯\_(ツ)_/¯

I don’t know — yet!

(The bug was diagnosed last week,

and honestly, it’s not a showstopper.)

Page 25: Bugs from Outer Space | while42 SF #6

What did we learn?

strace is awesome to analyze behavior of

running processes.

ltrace can be used, too, if you want to

analyze library calls rather than system calls.

If you’re really desperate, gdb is your friend.

(A very peculiar friend, but a friend

nonetheless.)

Page 26: Bugs from Outer Space | while42 SF #6

“Errare humanum est,

perseverare autem

diabolicum”

“To err is human,

but to really foul things up,

you need a computer”

Page 27: Bugs from Outer Space | while42 SF #6

Really nasty (and sad)

bug:

The Therac-25

Radiotherapy machine (shoots beams to cure cancer)

Two modes: low energy and high energy.

In high energy mode, a special filter is inserted.

In other versions, a hardware system prevented

the high energy beam from shooting if the

filter was not in place.

On the Therac-25, it’s in software.

Page 28: Bugs from Outer Space | while42 SF #6

Konami Code of Death

On the keyboard, press (in less than 8

seconds)

X ↑ E [ENTER] B

...And the high energy beam shoots, unfiltered!

6 accidents, 3 died. (This was 1985-1987.)

Explanation: race condition in the software.

Never happened during tests since this was

an unusual sequence—and operators were

Page 29: Bugs from Outer Space | while42 SF #6

Aggravating details

Many engineering and institutional issues.(No

software review, no evaluation of possible failures,

undocumented error codes, no sensor feedback…)

After entering the sequence and sending one

beam, the machine would display an error.

But errors happened “all the time” (usually

without adverse effect) so the operator would

just proceed (equivalent of pressing “retry”).

Page 30: Bugs from Outer Space | while42 SF #6

Let’s get back to weird

Linux Kernel bugs

Page 31: Bugs from Outer Space | while42 SF #6

Random crashes on EC2

Pool of ~50 identical instances, with same role.

Sometimes, one of them would crash.

Total crash: no SSH, no ping, no log, no

nothing.

EC2 console won’t show anything.

REPRODUCE THE BUG?

IMPOSSIBURU!

Page 32: Bugs from Outer Space | while42 SF #6

Try a million things...

Different kernel versions

Different filesystems tunings

Different security settings (GRSEC)

Different memory settings (overcommit, OOM)

Different instance sizes

Different EBS volumes

Different differences

NOTHING CHANGED

Page 33: Bugs from Outer Space | while42 SF #6

And one fine day...

A random test machine seems to exhibit the

bug very frequently (it would crash in a few

days, sometimes just a few hours).

CLONE IT!

ONE MILLION TIMES!

Page 34: Bugs from Outer Space | while42 SF #6

But, still...

We changed everything (again),

but we couldn’t find anything (again).

So we did something completely crazy:

we contacted AWS support (imagine that).

They asked us to repeat the tests with an

“official” image (AMI). This required porting

our runtime from Ubuntu 10.04 to 12.04.

Page 35: Bugs from Outer Space | while42 SF #6

And…(I’m running out of segues)

We re-ran the tests with the official image,

the machine crashed, we left it in crashed

state,

support analyzed the image.

Almost instanty, they told us

“oh yeah it’s a known issue,

see that link.”

U SERIOUS?

Page 36: Bugs from Outer Space | while42 SF #6

The explanation

The bug happens:

● on workloads using spinlocks intensively;

● only on Xen VMs with many CPUs.

It is linked to the special implementation of

spinlocks in Xen VMs.

When waking up CPUs waiting on a spinlock,

the code would only wake up the 1st one,

even if there were multiple CPUs waiting.

Page 37: Bugs from Outer Space | while42 SF #6

The patch (priceless) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c

index d69cc6c..67bc7ba 100644

--- a/arch/x86/xen/spinlock.c

+++ b/arch/x86/xen/spinlock.c

@@ -328,7 +328,6 @@ static noinline void

xen_spin_unlock_slow(struct xen_spinlock

*xl)

if (per_cpu(lock_spinners, cpu) == xl) {

ADD_STATS(released_slow_kicked, 1);

xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);

- break;

}

}

}

--

Page 38: Bugs from Outer Space | while42 SF #6

What did we learn?

We didn’t try all the combinations.

(Trying on HVM machines would have

helped!)

AWS support can be helpful sometimes.

(This one was a surprise.)

Trying to debug a kernel issue without console

output is like trying to learn to read in the

dark.

(Compare to local VM with serial output…)

Page 39: Bugs from Outer Space | while42 SF #6
Page 40: Bugs from Outer Space | while42 SF #6

Overall Conclusions

When facing a mystic bug from outer space:

● reproduce it at all costs!

● collect data with tcpdump, ngrep, wireshark,

strace, ltrace, gdb; and log files, obviously!

● don’t be afraid of uncharted places!

● document it, at least with a 2 AM ragetweet!

Page 41: Bugs from Outer Space | while42 SF #6

Thank you! Questions?

Gotta follow them all:

@kwarter

@while_42

@GITSF

@dot_cloud

@docker

Your speaker today was:

Jérôme Petazzoni, dotCloud

@jpetazzo