Top Banner
1 National Center for Supercomputing Applications LCI Conference 2007 SAN Persistent Binding and Multipathing in the 2.6 Kernel Michelle Butler, Technical Program Manager Andy Loftus, System Engineer Storage Enabling Technologies NCSA [email protected] or [email protected] Slides available at http://dims.ncsa.uiuc.edu/set/san/
44
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linux Mpio

1

National Center for Supercomputing ApplicationsLCI Conference 2007

SAN Persistent Binding andMultipathing in the 2.6 Kernel

Michelle Butler, Technical Program ManagerAndy Loftus, System EngineerStorage Enabling Technologies

[email protected] or [email protected]

Slides available at http://dims.ncsa.uiuc.edu/set/san/

Page 2: Linux Mpio

2

National Center for Supercomputing ApplicationsLCI Conference 2007

Who?• NCSA

– a unit of the University of Illinoisat Urbana-Champaign

– a federal, state, university, andindustry funded center

• Academic Users– NSF peer review

• Large amount ofapplications/user needs– 3rd party codes, user written…– All running on same

environment• Many research areas

Page 3: Linux Mpio

3

National Center for Supercomputing ApplicationsLCI Conference 2007

NCSA’s 1st Dell Cluster• Tungsten: 1750 server

cluster– 3.2 GHz Xeon

• 2,560 processors (computeonly)

• 16.4 TF; 3.8 TB RAM;122TB disk

• Dell OpenManage– Myrinet

• Full bi-section– Lustre over Gig-E

• 13 DataDirect 8500• 104 OSTs, 2 MDS

w/separate disk• 11.1 GB/sec sustained

– Power/Cooling• 593 KW / 193 tons

– Production date: April 2004

– User Environment• Platform Computing LSF• Softenv• Intel Compilers• ChaMPIon Pro, MPICH,

VMI-2

The fir

st

large-s

cale

Dell clu

ster!!!

Page 4: Linux Mpio

4

National Center for Supercomputing ApplicationsLCI Conference 2007

NCSA’s 3rd Dell Cluster• T2 – retired into:• Tungsten-3 1955 blade cluster

– 2.6 GHz Woodcrest Dual Core• 1,040 processors/2080 cores• 22 TF; 4.1 TB RAM; 20 TB disk• Warewulf

– Cisco InfiniBand• 3 to 1 over-subscribed• OFED-1.1 w/ HPSM subnet

manager– Lustre over IB

• 4 FasT controllers direct FC• 1.2GB/s sustained• 8 OSTs and 2 MDS w/complete

auto failovers– Power/Cooling

• 148 KW / 42 tons

– Production date: March 2007

– User Environment• Torque/Moab• Softenv• Intel Compilers• VMI-2

Page 5: Linux Mpio

5

National Center for Supercomputing ApplicationsLCI Conference 2007

NCSA’s 4th Dell Cluster• Abe: 1955 blade cluster

– 2.33 GHz Cloverton Quad-Core• 1,200 blades/9,600 cores• 89.5 TF; 9.6 TB RAM; 120 TB disk• Perceus management; diskless boot

– Cisco Infiniband• 2 to 1 oversubscribed• OFED-1.1 w/ HPSM subnet

manager– Lustre over IB

• 22 OSTs• 2 9500 DDN controllers direct FC• 10 FasT controllers on SAN fabric• 8.4GB/s sustained• 22 OSTs and 2 MDS w/complete

auto failovers– Power/Cooling

• 500 KW / 140 tons

– Production date: May 2007(anticipated)

– User Environment• Torque/Moab• Sofenv• Intel Compiler• MPI: evaluating Intel MPI,

MPICH, MVAPICH, VMI-2, etc.

The lar

gest

Dell clu

ster!!!

Page 6: Linux Mpio

6

National Center for Supercomputing ApplicationsLCI Conference 2007

NCSA Facility - ACB• Advanced Computation Building

– Three rooms, totals:• 16,400 sqft raised floor• 4.5 MW power capacity• 250 kW UPS• 1,500 tons cooling capacity

– Room 200:• 7,000 sqft – no columns• 70” raised floor• 2.3 MW power capacity• 750 tons cooling capacity

Page 7: Linux Mpio

7

National Center for Supercomputing ApplicationsLCI Conference 2007

NCSA’s Other Systems• Distributed Memory Clusters

– Mercury (IBM, 1.3/1.5 GHz Itanium2):• 1,846 processors• 10 TF; 4.6 TB RAM; 90 TB disk

• Shared Memory Clusters

– Copper (IBM p690,1.3 GHz Power4): 12 x 32processors

• 2 TF; 64 or 256 GB RAM each; 35 TB disk

– Cobalt (SGI Altix, 1.5 GHz Itanium2): 2 x 512 processors• 6.6 TF; 1 TB or 3 TB RAM; 250 TB disk

Page 8: Linux Mpio

8

National Center for Supercomputing ApplicationsLCI Conference 2007

NCSA Storage Systems• Archival: SGI/Unitree (5 PB total capacity)

– 72TB disk cache; 50 tape drives– currently 2.8PB of data in MSS

• >1PB ingested in last 6 months• project ~3.2PB by end of CY2006• licensed to support 5PB resident data

– ~30 data collections hosted

• Infrastructure: 394TB FiberchannelSAN connected– Fiberchannel SAN connected; FC and SATA environments– Lustre, IBRIX, NFS filesystems

• Databases:– 8 processor 12GB memory SGI Altix

• 30TB of SAN storage• Oracle 10G, mysql, Postgres

– Oracle RAC cluster– Single-system Oracle deployments for focused projects

Page 9: Linux Mpio

9

National Center for Supercomputing ApplicationsLCI Conference 2007

Visualization Resources• 30M-pixel Tiled Display Wall

– 8192 x 3840 pixels compositedisplay

– 40 NEC VT540 projectors, arrangedin a 5H x 8W matrix

– driven by 40-node Linux cluster• dual-processor 2.4GHz Intel Xeons

with NVIDIA FX 5800 Ultra graphicsaccelerator cards

• Myrinet interconnect• to be upgrade by early CY2007

– funded by State of Illinois

• SGI Prisms– 8 x 8 processor (1.6 GHz Itanium2)– 4 graphics pipes each; 1 GB RAM each– InfiniBand connection to Altix machines

Page 10: Linux Mpio

10

National Center for Supercomputing ApplicationsLCI Conference 2007

SAN at NCSA

• 1.3PB spinning disk– 895TB SAN attached

• 1392 Brocade switch ports• 7 SAN fabrics• 2 data centers

Page 11: Linux Mpio

11

National Center for Supercomputing ApplicationsLCI Conference 2007

Persistent Binding

• Device naming problems• Udev solution• Examples• Interactive Demo

Page 12: Linux Mpio

12

National Center for Supercomputing ApplicationsLCI Conference 2007

Device Naming ProblemBefore After

• Add hardware• SAN zoning• New SAN luns• Modify config

Device node mapping can change with changes to

- hardware

- software

- SAN

Devices assigned random names (based on next available major/minor pair for device type)

CLUSTER

- Multiple hosts that see the same disk will assign the disk to different device nodes

- may be /dev/sda on system1 but /dev/sdc on system2

- Can change with hardware changes; what used to be /dev/sda is not /dev/sdc

Devfs helps only a little:

- Fixes device naming; on a single host, disk will always have the same device node

- But different hosts may have different device names for the same physical disk

Page 13: Linux Mpio

13

National Center for Supercomputing ApplicationsLCI Conference 2007

What needs to happen

• Storage target always maps to samelocal device (ie. /dev/…)

• Local device name should be meaningful– /dev/sda conveys no information about the

storage device

Page 14: Linux Mpio

14

National Center for Supercomputing ApplicationsLCI Conference 2007

udev - Persistent Device Naming

• “Udev is … a userspace solution for adynamic /dev directory, with persistentdevice naming” *– Userspace: not required to remain in memory– Dynamic: /dev not filled with unused files– Persistent: devices always accessable using the

same device node• Provides for custom device names* Daniel Drake (http://www.reactivated.net/writing_udev_rules.html)

Devfs provides dynamic and persistent naming, but:

- kernel based - entire device db stored in kernel memory, never swapped

- not possible to customize device names

UDEV CUSTOM

- custom names for devices

- custom scripts can be run when specifice devices attached/removed

Page 15: Linux Mpio

15

National Center for Supercomputing ApplicationsLCI Conference 2007

Setting up udev device mapper

Overview

1. Uniquely identify each lun2. Assign a meaningful name to each lun

Page 16: Linux Mpio

16

National Center for Supercomputing ApplicationsLCI Conference 2007

1. Uniquely identify each lun

/sbin/scsi_id

Sample usage:root# scsi_id -g -u -s /block/sdaSSEAGATE_ST318406LC_____3FE27FZP000073302G5W

root# scsi_id -g -u -s /block/sdb3600a0b8000122c6d00000000453174fc

scsi_id SCSI INQUIRYdevice name

Unique id

/sbin/scsi_id

- INPUT: existing local device name

- OUTPUT: string that uniquely identifies the specific device (guaranteed unique among all scsi devices)

SAMPLE:

- sda: locally installed drive

- sdb: SAN attached disk

Page 17: Linux Mpio

17

National Center for Supercomputing ApplicationsLCI Conference 2007

2. Associate a meaningful name

• BUS=scsi– /sys/bus/scsi

• SYSFS– <BUS>/devices/H:B:T:L/<filename>

• PROGRAM & RESULT– Program to invoke and result to look for

• NAME– Device name to create (relative to /dev)

New udev rules file: /etc/udev/rules.d/20-local.rulesBUS="scsi", SYSFS{vendor}="DDN", SYSFS{model}="S2A 8000",PROGRAM="/sbin/scsi_id -g -u -s /block/%k ",RESULT="360001ff020021101092fadc32a450100", NAME="disk/fc/sdd4c1l0"

Custom naming controlled by rulesets stored in /etc/udev/rules.d

A rule is a lists of keys to match against.

When all keys match, the specified action is taken (create a device name or symlink)

Page 18: Linux Mpio

18

National Center for Supercomputing ApplicationsLCI Conference 2007

Example: Customizing for multiple paths

ProblemMultiple paths to a

single lun results inmultiple devicenodes.

Need to know whichpath each deviceuses.

Page 19: Linux Mpio

19

National Center for Supercomputing ApplicationsLCI Conference 2007

Example: Customizing for multiple paths

Custom script : mpio_scsi_id

Sample udev rule:BUS="scsi", SYSFS{vendor}="DDN", SYSFS{model}="S2A 8000",PROGRAM="/root/bin/mpio_scsi_id %k",RESULT="23000001ff03092f360001ff020021101092fadc32a450100",NAME="disk/fc/sdd4c1l0"

mpio_scsi_id scsi_iddevice name

WWPN + scsi_id

Disk CtlrWWPN

udev

Get disk controller WWPN

(Emulex) /sys/class/fc_transport/target<H>:<B>:<T>/port_name

(QLA) grep + awk to pull value from /proc/scsi/ql2xxx/<host_id>

Page 20: Linux Mpio

20

National Center for Supercomputing ApplicationsLCI Conference 2007

Demo: udev persistent device naming

• Single HBA• Single disk unit

– 4 luns– Each lun presented

through both controllers• Host sees 8 logical

luns• Use mpio_scsi_id

to identify the ctlr-lun

Page 21: Linux Mpio

21

National Center for Supercomputing ApplicationsLCI Conference 2007

Demo: udev persistent device naming

Original Configuation• udev config file

– /etc/udev/udev.conf

• scsi_id config file– /etc/scsi_id.config

• Scan fc luns– {sysfs}/hostX/scan– /dev/disk/by-id

Custom device names• Custom rules file

– 20-local.rules

• Restart udev– udevstart

• Custom devicenames created– /dev/disk/fc

BEGIN

- tail -f /var/log/messages

1. Enable udev logging

2. Enable scsi_id for all devices (options -g)

3. /proc/partitions

4. Scan fc luns (echo “- - -” > /sys/class/scsi_host/hostX/scan)

5. See udev log lines in messages file ; See fc disks in /dev/disk/by-id

6. Enable 20-local rules file

7. Udevstart

8. See udev log lines in messages file ; See fc disks in /dev/disk/fc

DEFAULT CONFIGURATION

Local rules file already exists. Disable it.

Default behavior for scsi_id is to blacklist everything unknown (-b option). Enable white list everything (-g option) so scsi_id’s will be returned.

Even before custom rules are in place, see default udev rule selection activity in /var/log/messages

After running delete_fc_luns, udev removes /dev/sdX devices files (/var/log/messages)

CUSTOM CONFIGURATION

Udev custom rules are selected (see /var/log/messages)

Major/Minor numbers line up for /dev/disk/fc/* and /proc/partition/*

Page 22: Linux Mpio

22

National Center for Supercomputing ApplicationsLCI Conference 2007

Demo: udev persistent device naming

Debugging• Not all sysfs files are available immediately

– HBA target WWPN– Add udevstart to boot scripts

• Udev tools can help– udevinfo– udevtest

Examples• udevinfo -a -p $(udevinfo -q path -n /dev/sdb)• udevtest /block/sdb

Exmaple: multiple paths on Nadir

- If luns are removed (delete_fc_luns)

- Then added (scan_fc_luns)

- No matches are found in 20-local.rules

- Add syslog output to mpio_scsi_id

+ Shows params the script is called with

+ Shows what the script returns

+ target_wwpn is not getting set

- Run udevstart (luns already attached now), matches found in 20-local.rules and device files created

Probably either a driver or udev issue.

Easiest solution is to run scan_luns and udevstart at system boot time (/etc/rc.d/rc.local)

Page 23: Linux Mpio

23

National Center for Supercomputing ApplicationsLCI Conference 2007

Custom script: ls_fc_lunsGet HBA list sysfs

Get target list

Get lun list

Get lun info

Get HBA type lspci

sysfs (emulex)/proc (QLA)

sysfs

sysfs

/sys/class/fc_host

/sys/class/scsi_host/hostX/targetX:Y:Z/proc/scsi/qla2xxx/X

/sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L

/sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L/*

0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:0:0 sdb 3600a0b8000122c6d00000000453174fc0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:0:1 sdc 3600a0b80000fd63200000000453175630x10000000c95ebeb4 0x200200a0b8122c6e 2:0:1:0 sdi 3600a0b8000122c6d00000000453174fc0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:1:1 sdj 3600a0b80000fd6320000000045317563

Page 24: Linux Mpio

24

National Center for Supercomputing ApplicationsLCI Conference 2007

Custom script: lip_fc_hosts

Get host list ls_fc_luns

echo “1” > /sys/class/fc_host/hostX/lip

Page 25: Linux Mpio

25

National Center for Supercomputing ApplicationsLCI Conference 2007

Custom script: scan_fc_luns

Get host list ls_fc_luns

echo “- - -” > /sys/class/scsi_host/hostX/scan

Page 26: Linux Mpio

26

National Center for Supercomputing ApplicationsLCI Conference 2007

Custom script: delete_fc_luns

Get lun list ls_fc_luns

echo “1” > /sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L/delete

Page 27: Linux Mpio

27

National Center for Supercomputing ApplicationsLCI Conference 2007

udev - Additional Resources• man udev• http://www.emulex.com/white/hba/wp_linux26udev.pdf

– Excellent white paper• http://www.reactivated.net/udevrules.php

– How to write udev rules

• http://www.us.kernel.org/pub/linux/utils/kernel/hotplug/udev.html– Information and links

• http://dims.ncsa.uiuc.edu/set/san– FC tools : custom tools used in demo

Page 28: Linux Mpio

28

National Center for Supercomputing ApplicationsLCI Conference 2007

Linux Multipath I/O

• Overview• History• Setup• Demos

– Active / Passive Controller Pair– Active / Active Controller Pair

Page 29: Linux Mpio

29

National Center for Supercomputing ApplicationsLCI Conference 2007

Linux Multipath - HistoryProviders

• Storage Vendor• HBA Vendor• Filesystem• OS

STORAGE VENDOR

- End to end solution (they provide disk, HBA, driver, add’l software, sometimes even FC switch)

- HBA’s (and other parts) come at a markup

- One location for support tickets, but no alternate recourse if they can’t fix the problem

- Proprietary requirements (typically require 2 HBA’s, only works with their systems)

HBA VENDOR

- QLA

> Linux support spotty

+ 2.4 kernel ok, but strict requirements (2 HBA’s, exactly 2 paths per lun, active/active controllers)

+ 2.6 kernel inconsistent behavior

> Solaris support spotty (2 months to get 1 machine working, next month stops working, machine wasuntouched)

> Dropped Windows support prematurely (Windows MPIO layer not complete yet, only an API forvendors)

> Proprietary solution, only works with their HBA’s and configuration software

- Emulex (unix philosophy, do one thing and do it well; MPIO doesn’t belong in the driver)

FILESYSTEM

- 3rd party - Veritos, others??

- Parallel Filesystems - Ibrix, Lustre, GPFS, CXFS (enable MPIO via failover hosts)

OS

- *NEW* Solaris 10 (XPATH, but requires Solaris branded QLA cards)

- *NEW* Linux (device mapper multipath) (RedHat4, Suse, others…)

Page 30: Linux Mpio

30

National Center for Supercomputing ApplicationsLCI Conference 2007

Device Mapper Multipath• Identify luns by scsi_id• Create “path groups”

– Round-robin I/O on all pathsin groups

• Monitor paths for failure– When no paths left in current

group, use next group

• Monitor failed paths forrecovery– Upon path recovery, re-

check group priorities– Assign new active group if

necessary

Page 31: Linux Mpio

31

National Center for Supercomputing ApplicationsLCI Conference 2007

Linux Device Mapper Multipath

Overview

1. Identify unique luns2. Monitor active paths for failure3. Monitor failed paths for recovery

Multipath handles 3 areas.

All settings are saved in /etc/multipath.conf

Page 32: Linux Mpio

32

National Center for Supercomputing ApplicationsLCI Conference 2007

1. Identify unique luns

Storage Device• vendor• product• getuid_callout

device { vendor "DDN" product "S2A 8000" getuid_callout "/sbin/scsi_id -g -u -s /block/%n"}

Page 33: Linux Mpio

33

National Center for Supercomputing ApplicationsLCI Conference 2007

1. Identify unique luns

Multipath Device• wwid• alias

multipath { wwid 360001ff020021101092fb1152a450900 alias sdd4l0}

Page 34: Linux Mpio

34

National Center for Supercomputing ApplicationsLCI Conference 2007

2. Monitor Healthy Paths for Failure

• Priority group– Collection of paths to

the same physical lun– I/O is split across all

paths in round-robinfashion

• path_grouping_policy– multibus– failover– group_by_prio– group_by_serial– group_by_node

Multipath control creates priority groups.

Paths are grouped based on path_grouping_policy

MULTIBUS - all paths in one priority group (DDN) (no penalty to access luns via alternate controllers)

FAILOVER - one path per priority group (Use only 1 path at a time) (typically only 1 usable path, such asIBM fastt with AVT disabled)

GROUP_BY_PRIO - Paths with same priority in same priority group, 1 group for each unique priority(Priorities assigned by external program)

GROUP_BY_SERIAL - Paths grouped by scsi target serial (controller node WWN)

GROUP_BY_NODE - (I have not tested or researched this, never had a need to)

Page 35: Linux Mpio

35

National Center for Supercomputing ApplicationsLCI Conference 2007

2. Monitor Healthy Paths for Failure

• Path Priority– Integer value assigned to a

path– Higher value == higher

priority– Directly controls priority

group selection

• prio_callout– 3rd party pgm to assign

priority values to each path

prio_callout

multipath

Integer value Device name

Path Grouping Policy = group_by_prio

Only matters if using “group_by_prio” grouping policy

DIRECTLY CONTROLS PRIORITY GROUP SELECTION

- Priority group with highest value is active group

PREVIOUS SLIDE - When all paths in a group are failed, next group becomes active. That would be thepriority group with the next highest priority value that has an active path.

PRIO_CALLOUT

- Provided by vendor or (more typically) custom script written by admin for specific setup

- If not using group_by_prio, then set this to /bin/true

Page 36: Linux Mpio

36

National Center for Supercomputing ApplicationsLCI Conference 2007

2. Monitor Healthy Paths for Failure

• path_checker– tur– readsector0– directio– (Custom)

• emc_clarion• hp_sw

• no_path_retry– queue– (N > 0)– fail

TUR

- SCSI Test Unit Ready

- Preferred if lun supports it (OK on DDN, IBM fastt)

- Does not cause AVT on IBM fastt

- Does not fill up /var/log/messages on failures

READSECTOR0

- physical lun access via /dev/sdX (IS THIS CORRECT???)

DIRECTIO

- physical lun access via /dev/sgY (IS THIS CORRECT???)

Both readsector0 and directio cause AVT on IBM fastt, resulting in lun thrashing

Both readsector0 and directio log “fail” messages in /var/log/messages (could be useful if you want tomonitor logs for these events)

NO_PATH_RETRY

- # of retries before failing path

- queue: queue I/O forever

- (N > 0): queue I/O for N retries, then fail

- fail: fail immediately

Page 37: Linux Mpio

37

National Center for Supercomputing ApplicationsLCI Conference 2007

3. Monitor failed paths for recovery

• Failback– Immediate (same as n=0)– (n > 0)– manual

FAILBACK

- When a path recovers, wait # seconds before enabling the path

- Recovered path is added back into multipath enabled path list

- multipath re-evaluates priority groups, changes active priority group if neededMANUAL RECOVERY

- User runs ‘/sbin/multipath’ to update enabled paths and priority groups

Page 38: Linux Mpio

38

National Center for Supercomputing ApplicationsLCI Conference 2007

Putting it all togehtermultipaths { multipath { wwid 3600a0b8000122c6d00000000453174fc alias fastt21l0 } multipath { wwid 3600a0b80000fd6320000000045317563 alias fastt21l1 }}devices { device { vendor "IBM" product "1742-900" getuid_callout "/sbin/scsi_id -g -u -s /block/%n"

path_grouping_policy group_by_prio prio_callout "/usr/local/sbin/path_prio.sh %n"

path_checker tur no_path_retry fail failback immediate }}

Page 39: Linux Mpio

39

National Center for Supercomputing ApplicationsLCI Conference 2007

Putting it all together

/usr/local/etc/primary-paths0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:0 sdb 3600a0b8000122c6d00000000453174fc 500x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:1 sdc 3600a0b80000fd6320000000045317563 20x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:2 sdd 3600a0b8000122c6d0000000345317524 500x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:3 sde 3600a0b80000fd6320000000245317593 20x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:0 sdi 3600a0b8000122c6d00000000453174fc 50x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:1 sdj 3600a0b80000fd6320000000045317563 510x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:2 sdk 3600a0b8000122c6d0000000345317524 50x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:3 sdl 3600a0b80000fd6320000000245317593 51

path_prio.shmultipath Primary-pathsmatchingline

sdb

path_prio.sh

50

PATH_PRIO.SH

- grep device from primary-paths file

- return value from last column

Page 40: Linux Mpio

40

National Center for Supercomputing ApplicationsLCI Conference 2007

Demo: Active/Passive Disk• Host

– One Emulex LP11000• Disk

– IBM DS4500– Luns presented through

both controllers– Luns accessible via 1

controller only at a time– AVT enabled

AVT

- Lun will migrate to alternate controller if requested there

- Tolerance of cable/switch failure

- AVT penalty - lun inaccessible for 5-10 secs while controller ownership changing

SCREENS: /var/log/messages , multi-port-mon , command , script host

1. No luns (ls_fc_luns)

2. /etc/multipath.conf

1. Multipaths (fastt)

2. Devices (fastt)

3. /usr/local/sbin/path_prio.sh

1. Identify controller A, controller B

4. /usr/local/etc/primary-paths

5. Add luns (scan_fc_luns)

1. See multipath bindings & path_prio.sh output in /var/log/messages

6. View current multipath configuration

1. Multipath -v2 -l

7. Failover test

1. Script-host: disable disk port A

2. See multipathd reconfig in /var/log/messages

3. See I/O path change in multi-port-mon

8. Recover test

1. Script-host: enable disk port A

Page 41: Linux Mpio

41

National Center for Supercomputing ApplicationsLCI Conference 2007

Demo: Active/Active Disk• Host

– One Emulex LP11000• Disk

– DDN 8500– Luns accessible via

both controllers (nopenalty)

SCREENS: multi-port-mon , /var/log/messages , command , script-host

1. /etc/multipath.conf

1. Devices (DDN) (path_prio = /bin/true ; path_grouping_policy = multibus)

2. Multipath (DDN)

2. Luns present? (ls_fc_luns) Add luns if needed (scan_fc_luns)

1. See multipath bindings in /var/log/messages

3. View multipath configuration

1. Multipath -v2 -l

4. Failover test

1. Expected changes in multi-port-mon

2. Disable switch port for disk ctlr 1

3. See failover in /var/log/messages and multi-port-mon

5. Restore ctlr access

1. Expected changes in multi-port-mon

2. Enable switch port for disk ctlr 1

3. See failback in /var/log/messages and multi-port-mon

Page 42: Linux Mpio

42

National Center for Supercomputing ApplicationsLCI Conference 2007

Path Grouping Policy Matrix

failover *multiple pointsof failure

Active/Passivew/o AVT

path_prio(demo2)path_prio

Active/Passivewith AVT

multibus(demo1)multibus

Active/Active

2 HBAs1 HBA

ACTIVE/ACTIVE 2 HBAs

- trivial, same as demo1

- Each HBA sees 1 ctlr

- Can let both HBAs see both ctlrs (4 paths to each lun)

+ Use path_prio if need to control path usage

ACTIVE/PASSIVE (AVT) 2 HBAs

- trivial, similar to demo2

ACTIVE/PASSIVE (no AVT) 1 HBA

- Tolerant of ctlr failure only.

- If anything else fails, luns will not AVT to alternate ctlr, host will lose access

ACTIVE/PASSIVE (no AVT) 2 HBAs

- Non-preferred paths will be failed

- Each HBA must have full access to both controllers

Page 43: Linux Mpio

43

National Center for Supercomputing ApplicationsLCI Conference 2007

Linux Multipath Errata• Making changes to multipath.conf

– Stop multipathd service– Clear multipath bindings

•/sbin/multipath -F

– Create new multipath bindings•/sbin/multipath -v2 -l

– Start multipathd service• Cannot multipath root or boot device• user_friendly_names

– Not really, just random names dm-1, dm-2 …

CANNOT MULTIPATH ROOT OR BOOT DEVICE

- per ap-rhcs-dm-multipath-usagetxt.html (see references section)

Page 44: Linux Mpio

44

National Center for Supercomputing ApplicationsLCI Conference 2007

Linux Multipath Resources• multipath.conf.annotated• man multipath• http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=H

ome– Multipath tools official home

• http://www.redaht.com/docs/manuals/csgfs/browse/rh-cs-en/ap-rhcs-dm-multipath-usagetxt.html– Description of output (multipath -v2 -l)

• http://kbase.redhat.com/faq/FAQ_85_7170.shtm– Setup device-mapper multipathing in Red Hat Enterprise Linux 4?

• http://dims.ncsa.uiuc.edu/set/san– Multi-port-mon– Set switchport state : (en/dis)able switch port via SNMP

MULTIPATH.CONF.ANNOTATED (RedHat)

- /usr/share/doc/device-mapper-multipath-0.4.5/multipath.conf.annotated