Top Banner
Servers and Processes Behavior and Analysis
63
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Servers and Processes: Behavior and Analysis

Servers and Processes

Servers and ProcessesBehavior and AnalysisBehavior and Analysis

Page 2: Servers and Processes: Behavior and Analysis

The Next 90 MinutesThe Next 90 Minutes

Introduction

Servers, a mental model

Getting hands on

Processes

Wrapping it up

Page 3: Servers and Processes: Behavior and Analysis

CaveatsCaveats

Tutorial aimed at people barely familiar with Linux consoles

Little server knowledge is assumed

Many advanced things are glossed over

...but feel free to ask!

The slides will be available online

Page 4: Servers and Processes: Behavior and Analysis

Your PresenterYour Presenter

Mark Smith <[email protected]>

Co-founded Dreamwidth Studios, but works at Bump Technologies (http://bu.mp/)

Spent time at Google, Mozilla, others

Sysadmin, MySQL DBA, engineer, ...

Page 5: Servers and Processes: Behavior and Analysis

ServersServers

Page 6: Servers and Processes: Behavior and Analysis

ServersServers

Machines that take input and make output

Made up of components: RAM, CPU, I/O

Each component has various capacities

Systems Administration: the understanding, care, and feeding of all these disparate components (among other things)

Page 7: Servers and Processes: Behavior and Analysis

ComponentsComponents

Capacity

Latency

Throughput

Full state

Thrash state

Page 8: Servers and Processes: Behavior and Analysis

RAMRAM

Capacity measured in bytes (GB usually)

Latency measured in nanoseconds

Throughput measured in bytes/second

Full state: can’t add more, but no real loss of performance

Thrash state: not very relevant

Page 9: Servers and Processes: Behavior and Analysis

Disk (Rotational)Disk (Rotational)Capacity measured in bytes (GB or TB)

Latency measured in milliseconds

Throughput measured in bytes/second

Full state: can’t add more, but otherwise fine

Thrash state: server and process starvation, performance drops drastically

Page 10: Servers and Processes: Behavior and Analysis

Disk (SSD)Disk (SSD)Capacity measured in bytes (GB or TB)

Latency measured in milliseconds (but 100x faster than rotational disks)

Throughput measured in bytes/second

Full state: can’t add more, but otherwise fine

Thrash state: obviated by lack of rotation

Page 11: Servers and Processes: Behavior and Analysis

CPUCPUCapacity measured in operations per second, also known as hertz (MHz, GHz, etc)

Throughput and latency of a CPU are very advanced things most sysadmins don’t need to worry about (e.g., optimizing for L1 cache and local RAM in NUMA systems)

Full/thrash state: system/process starvation

Page 12: Servers and Processes: Behavior and Analysis

NetworkNetworkCapacity not relevant

Latency measured in milliseconds (usually)

Throughput measured in bits/second and usually 1 Gbps (10 Gbps becoming common)

Full state: dropped packets, behavior depends on protocol (i.e., TCP or UDP)

Thrash state: not relevant

Page 13: Servers and Processes: Behavior and Analysis

Timing ComparisonsTiming Comparisons

1 second - tick, tock, tick, tock, ...

1,000 milliseconds (ms) per second

1,000,000 microseconds (µs) per second

1,000,000,000 nanoseconds (ns) per second

Page 14: Servers and Processes: Behavior and Analysis

Timing (Part 2)Timing (Part 2)

One seek on a rotational disk is ~6ms

SSD seeks are about 100µs: 60x faster than a rotational seek

RAM seeks are about 60ns: 1,666x faster than an SSD seek (100,000x faster than a rotational seek!)

Page 15: Servers and Processes: Behavior and Analysis

Hands On Time!Hands On Time!

Page 16: Servers and Processes: Behavior and Analysis

SSH to the VMSSH to the VM

Open your local terminal (PuTTY in Windows, iTerm/Terminal/etc in Mac OS X, whatever you like in Linux)

ssh -p 2222 [email protected]

Password is “demo”

Please be nice :)

Page 17: Servers and Processes: Behavior and Analysis

It’s dark in here.It’s dark in here.

Heartbeat the machine

uptime How’s it doing?

free -m How’s the RAM?

df -h How’re the disks?

Page 18: Servers and Processes: Behavior and Analysis

Load AverageLoad Average

It’s a seat-of-the-pants number

Rule of thumb: low is good, high might be bad

You have to learn how your machines work for this number to mean much

Page 19: Servers and Processes: Behavior and Analysis

Top of the WorldTop of the World

Easy way to see what’s running and what is consuming the most resources

top

Press “P” to sort by Processor usage

Press “M” to sort by Memory usage

Page 20: Servers and Processes: Behavior and Analysis

Exhibit #1Exhibit #1

Now I will do something on the machine

Run through your heartbeat steps again: uptime, free -m, df -h, top

Remember to sort top by P and M

What has changed? What is going on?

Page 21: Servers and Processes: Behavior and Analysis

Results #1Results #1

You probably noticed 1-cpu.pl

It’s pushing the CPU to 100%

Is it broken? Is this bad?

Know your software and systems (very important to know what normal is)

Page 22: Servers and Processes: Behavior and Analysis

Exhibit #2Exhibit #2

Now I will do something else

Run through your heartbeat steps again: uptime, free -m, df -h, top

Remember to sort top by P and M

What has changed? What is going on?

Page 23: Servers and Processes: Behavior and Analysis

Results #2Results #2

Lots of memory is being consumed

It’s some 2-memory.pl command

Does the machine feel sluggish? Each command takes a second to start and stop?

What is going on here?

Page 24: Servers and Processes: Behavior and Analysis

vmstatvmstatThe vmstat tool tells us useful things about the state of the kernel and resource usage

Try: vmstat -SM 1

Watch while I run the test again

Note the si/so and bi/bo columns

Now notice the CPU columns on the right

Page 25: Servers and Processes: Behavior and Analysis

SwapSwapRAM is a finite resource

Not all RAM is used equally

Kernel tracks usage of pages

Kernel can write RAM to disk and free it up

This is called swapping: you store RAM on disk. Remember the timing slide!

Page 26: Servers and Processes: Behavior and Analysis

Swap (Part 2)Swap (Part 2)Swap is useful mostly on consumer machines

In most server environments, swap is death

Disks are hundreds to thousands of times (or more!) slower than RAM

Generally, any active swapping is bad

Page 27: Servers and Processes: Behavior and Analysis

Exhibit #3Exhibit #3

Try uptime, free -m, df -h, top again

Also, try: iostat -kx 1

Watch the %util column as this test runs

Also the bi/bo columns in vmstat

What is going on here?

Page 28: Servers and Processes: Behavior and Analysis

Results #3Results #3

Disk usage is high

RAM is not full

CPU is not pegged

Machine responds well

Disk utilization at 100%

Page 29: Servers and Processes: Behavior and Analysis

What does it mean?What does it mean?

Based on the various data you’ve gathered, is the machine healthy and happy with this program running on it?

Why or why not?

Discussion.

Page 30: Servers and Processes: Behavior and Analysis

Solutions?Solutions?This program is using more RAM or CPU than the machine has available

Program can be optimized to use less

Machine can be upgraded to have more

Simple problem, straightforward solutions

(Straightforward does not always mean easy)

Page 31: Servers and Processes: Behavior and Analysis

ProgramsPrograms

Page 32: Servers and Processes: Behavior and Analysis

ProgramsPrograms

Software that runs on a machine

Has traits such as single- or multi-threaded, compiled or interpreted, etc

Requires certain resources and inputs

Makes certain outputs

Page 33: Servers and Processes: Behavior and Analysis

More ConstraintsMore Constraints

Programs have more constraints to consider

Open files and sockets (file descriptors)

Permissions (depend on user/group)

CPU limits (depends on threads)

Page 34: Servers and Processes: Behavior and Analysis

Exhibit #4Exhibit #4

There’s a program running now, but something is wrong with it

Use the usual tools (uptime, free -m, df -h, top)

System looks OK...

Page 35: Servers and Processes: Behavior and Analysis

File LimitsFile Limits

Programs have certain limits

Get the PID of the 4-files.pl program

ps aufx | grep 4-files

cat /proc/PID/limits

Page 36: Servers and Processes: Behavior and Analysis

lsoflsof

See what files a program has open

lsof -np PID

Woah, lots! At the limit? Count them:

lsof -np PID | wc -l

Page 37: Servers and Processes: Behavior and Analysis

But... a problem?But... a problem?

But is this a problem? Well, it is if the program is trying to open more files

How do we tell?

Software calls open, which is a system call

Page 38: Servers and Processes: Behavior and Analysis

System CallsSystem Calls

The kernel provides certain services

Almost all I/O goes through the kernel

Current time, fork, cd, exec, etc etc

Requires a small context switch

Can lead to “sys” CPU usage

Page 39: Servers and Processes: Behavior and Analysis

stracestrace

System calls made by a process can be traced

Let’s look at 4-files again:

sudo strace -p PID

Look at the “open” line, is it OK?

Page 40: Servers and Processes: Behavior and Analysis

Results #4Results #4

Clearly this program is broken

Several fixes... open fewer files, raise your limits, etc

(We won’t cover the specifics of raising limits, you can search Google if you need it)

Page 41: Servers and Processes: Behavior and Analysis

It’s all turtles.It’s all turtles.

Linux uses “files” and “filesystems” a lot

Sockets are just “files”, they use the same file descriptor number space

Result: “Max open files” includes sockets

They also show up in lsof, too!

Page 42: Servers and Processes: Behavior and Analysis

Exhibit #5Exhibit #5

Let me give us a new program

Get the PID, remember how?

ps aufx | grep 5-network

Look at the files: lsof -np PID

Note the “TCP” file!

Page 43: Servers and Processes: Behavior and Analysis

Test the ServerTest the Server

telnet 182.255.123.52 7000

(This server is slow, it might take a bit)

A very simple timeserver

Now: strace -p PID

Page 44: Servers and Processes: Behavior and Analysis

The TraceThe Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474),

sin_addr=inet_addr("127.0.0.1")}, [16]) = 4

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

fcntl(4, F_SETFD, FD_CLOEXEC) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

nanosleep({1, 0}, 0x7fff73f28880) = 0

write(4, "Thank you for visiting!\n", 24) = 24

close(4) = 0

Page 45: Servers and Processes: Behavior and Analysis

The TraceThe Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474),

sin_addr=inet_addr("127.0.0.1")}, [16]) = 4

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

fcntl(4, F_SETFD, FD_CLOEXEC) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

nanosleep({1, 0}, 0x7fff73f28880) = 0

write(4, "Thank you for visiting!\n", 24) = 24

close(4) = 0

Page 46: Servers and Processes: Behavior and Analysis

The TraceThe Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474),

sin_addr=inet_addr("127.0.0.1")}, [16]) = 4

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

fcntl(4, F_SETFD, FD_CLOEXEC) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

nanosleep({1, 0}, 0x7fff73f28880) = 0

write(4, "Thank you for visiting!\n", 24) = 24

close(4) = 0

Page 47: Servers and Processes: Behavior and Analysis

The TraceThe Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474),

sin_addr=inet_addr("127.0.0.1")}, [16]) = 4

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

fcntl(4, F_SETFD, FD_CLOEXEC) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

nanosleep({1, 0}, 0x7fff73f28880) = 0

write(4, "Thank you for visiting!\n", 24) = 24

close(4) = 0

Page 48: Servers and Processes: Behavior and Analysis

The TraceThe Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474),

sin_addr=inet_addr("127.0.0.1")}, [16]) = 4

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

fcntl(4, F_SETFD, FD_CLOEXEC) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

nanosleep({1, 0}, 0x7fff73f28880) = 0

write(4, "Thank you for visiting!\n", 24) = 24

close(4) = 0

Page 49: Servers and Processes: Behavior and Analysis

The TraceThe Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474),

sin_addr=inet_addr("127.0.0.1")}, [16]) = 4

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

fcntl(4, F_SETFD, FD_CLOEXEC) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

nanosleep({1, 0}, 0x7fff73f28880) = 0

write(4, "Thank you for visiting!\n", 24) = 24

close(4) = 0

Page 50: Servers and Processes: Behavior and Analysis

The TraceThe Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474),

sin_addr=inet_addr("127.0.0.1")}, [16]) = 4

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

fcntl(4, F_SETFD, FD_CLOEXEC) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

nanosleep({1, 0}, 0x7fff73f28880) = 0

write(4, "Thank you for visiting!\n", 24) = 24

close(4) = 0

Page 51: Servers and Processes: Behavior and Analysis

The TraceThe Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474),

sin_addr=inet_addr("127.0.0.1")}, [16]) = 4

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

fcntl(4, F_SETFD, FD_CLOEXEC) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

nanosleep({1, 0}, 0x7fff73f28880) = 0

write(4, "Thank you for visiting!\n", 24) = 24

close(4) = 0

Page 52: Servers and Processes: Behavior and Analysis

The TraceThe Traceaccept(3, {sa_family=AF_INET, sin_port=htons(39474),

sin_addr=inet_addr("127.0.0.1")}, [16]) = 4

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff73f27608) = -1 ENOTTY ...

lseek(4, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)

fcntl(4, F_SETFD, FD_CLOEXEC) = 0

stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0

write(4, "The time is: Wed Feb 6 13:34:22"..., 38) = 38

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0

rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

nanosleep({1, 0}, 0x7fff73f28880) = 0

write(4, "Thank you for visiting!\n", 24) = 24

close(4) = 0

Page 53: Servers and Processes: Behavior and Analysis

Results #5Results #5

Tracing shows you data, too

Can be very valuable for finding moving parts that aren’t moving well

Combined with the other tools you can really see what is going on in your system

Page 54: Servers and Processes: Behavior and Analysis

KernelKernel

Page 55: Servers and Processes: Behavior and Analysis

Invisible GlueInvisible Glue

Kernel issues are fairly rare, but usually frustrating if they show up

Usually the result of some sort of limit hit

Tons of caches, buckets, and limits

Be suspicious of “powers of two” numbers

Page 56: Servers and Processes: Behavior and Analysis

Common ChecksCommon Checks

Try: sudo dmesg

Kernel message log shows many problems

Look for suspicious messages

Page 57: Servers and Processes: Behavior and Analysis

“Suspicious”“Suspicious”

Out of memory: Kill process 19393 (2-memory.pl) score 90 or sacrifice child

nf_conntrack: Table full, dropping packet

ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Page 58: Servers and Processes: Behavior and Analysis

More Places to LookMore Places to Look

The /var/log directory has much data

Generally in a problem state, look for recently updated files: ls -lart

Loud logs are often unhappy logs

Hardware failure is often noted in one of the log files

Page 59: Servers and Processes: Behavior and Analysis

SummarySummary

Page 60: Servers and Processes: Behavior and Analysis

ProcessProcessCheck the components: CPU, RAM, disks

Find what limits are being hit and by what

If the system is fine, it’s probably software

Trace the program, check the logs

Analyze well before you fix

Page 61: Servers and Processes: Behavior and Analysis

Familiarity!Familiarity!

Systems administration done only as an afterthought will be painful and hard

Be familiar with your servers and your software

Keep a shell open, watch top throughout the day, watch the disks, etc

Page 62: Servers and Processes: Behavior and Analysis

Next StepsNext Steps

Certain tools make life easier

Nagios for monitoring (e.g., alert you when CPU exceeds 90%)

Cacti/Ganglia/OpenTSDB for trending

Fabric for multiple machine operations

Puppet/Chef for configuration management

Page 63: Servers and Processes: Behavior and Analysis

Thanks!Thanks!Questions?Questions?