Best Practices for top SAMG operational challenges - …vox.veritas.com/legacyfs/online/veritasdata/SM B19.pdf · 2016-07-04 · Timeouts in Veritas Cluster Server ... •Check keys

Best Practices for top SAMG operational challenges

Kalyan Subramaniyam SR. Principal Business Critical Engineer Symantec Business Critical Services

Kalyan Subramaniyam – Best Practices for top SAMG operational challenges Symantec Vision 2012 1

SYMANTEC VISION 2012 2

When a system panics? – Global Atomic Broadcast 1

What happens when diskgroup is disabled?

Removing stale fencing keys

Timeouts in Veritas Cluster Server (VCS)

How to change host name in a Solaris environment

Difference between a ‘PR’, ‘RP’, ‘SP’ and ‘P’ release

Rebuilding a disk group from CBR data

Testing Low Latency Transport (LLT)

How to Start/Stop ports used by SFCFS and SFRAC manually

Oracle Database performance impact with VxFS

Best Practices for top SAMG operational challenges

2

3

4

5

6

7

8

9

10

SYMANTEC VISION 2012

When a system panics? – Global Atomic Broadcast • Client Process failure (GAB initiated IOFENCE):

– When HAD (client) fails to heartbeat to GAB within VCS_GAB_TIMEOUT, default : 15sec (SF 5.0), 30sec (SF 5.1)

GAB Port h halting node due to client process failure

– GAB tries to kill HAD 5 times (gab_kill_ntries , gab_isolate_time)

– In case of performance issues: increase the value of VCS_GAB_TIMEOUT to allow VCS more time to heartbeat

– In case of kernel problem: configure GAB (gabconfig –k) to not panic but to continue to attempt killing HAD

– If problem persist, collect performance stats and crash dumps and contact Symantec support.

• Registration monitoring (SF 5.1)

– hashadow process cannot restart HAD or restarted, but HAD could not register

– hashadow and HAD were killed by user intervention and not running

– You can restart HAD or unconfigure GAB within VCS_GAB_RMTIMEOUT (200 sec)

– GAB takes action if HAD does not register within the time defined in VCS_GAB_RMTIMEOUT

– Control the GAB behavior by setting VCS_GAB_RMACTION to PANIC or SYSLOG (default)

Best Practices for top SAMG operational challenges 3


What happens when diskgroup is disabled ?

• When is a disk group disabled?

– Kernel log, config copies or headers in the private region are invalid

• When can the disabled disk group be deported?

– All their I/Os completed and all volumes are closed

• What happens when a disk cable is disconnected?

– I/Os to volumes return errors. Disk private region cannot be updated

– In VCS, if force unmount succeeds , disk group can be deported

• PanicSystemOnDGLoss (DiskGroup agent attribute)

– Enabled: Will panic, if disk group is marked disabled (dgdisabled)

– Enabled: Will panic, when monitor timeout due to “monitor hung”

– Disabled: If I/O fencing enabled, DiskGroup resource marked FAULTED

– Default: Disabled (5.1 / 6.0) ; Enabled (5.0) ; No action: SG Frozen



Removing stale fencing keys

• Utility: /opt/VRTSvcs/vxfen/bin/vxfenclearpre

– Use when split-brain condition encountered

– Does not support server-based fencing (coordinator point server)

– To remove SCSI-3 registrations and reservations on the disks • Stop VCS, I/O fencing on all nodes

• Shutdown all application that run outside VCS control that have access to shared storage

• Start the script # vxfenclearpre

• To manually remove stale fencing keys • Stop VCS,I/O Fencing on all nodes and shutdown all application access shared storage

• Check keys on all paths : # vxdisk –qeo alldgs list | awk `{print “/dev/rdsk/”$6}’ > /tmp/disks

• Check registrations : # /sbin/vxfenadm –g all –f /tmp/disks

• Check reservations : # /sbin/vxfenadm –r all –f /tmp/disks

• Clear keys : # /sbin/vxfenadm –a –k tmp –f /tmp/disks

# /sbin/vxfenadm –x –k tmp –f /tmp/disks

• Monitor fencing keys: http://www.symantec.com/docs/TECH78563


http://www.symantec.com/docs/TECH78563


Timeouts in VERITAS Cluster Server

• When system resource is being used to its limits, you will see multiple VCS resource timeouts in a short period of time.

• Timeouts can cause the clean component to be called, this is essentially a forced offline by VCS

• Lighten the load on these cluster nodes by not using so much CPU on these systems, as this will prevent the timeouts in the first place

• Recommended tunable:

– FaultOnMonitorTimeouts (default 4)

• Defines the number of consecutive monitor timeouts that can occur before clean is called.

– MonitorTimeout (default 60 sec)

• Defines how long the monitor will wait before it declares the resource to be timed out.

– ToleranceLimit (default 0)

• Defines the number of times the Monitor routine should return an offline status before declaring a resource offline.

• Typically used when resource is busy and appears to be offline.

• Memory leak in agents fixed in VCS 5.0MP3RP5 and VCS 5.1SP1RP2



How to change the host name in Solaris environment

– Stop the cluster using `hastop –all`

– Modify the following files on the node that is being changed: /etc/hosts /etc/VRTSvcvs/conf/config/main.cf

/etc/llthosts /etc/<hostname>.<interface>

/etc/llttab /etc/nodename

/etc/VRTSvcs/conf/sysname (it may not exists, depending on the configuration)

– Modify the following files on the rest of the nodes in the cluster – /etc/llthosts change the old host name

– As appropriate: Operating system files (/.rhosts, /etc/hosts, /etc/hosts.equiv)

– Verify cluster configuration on the changed host:

# hacf –verify /etc/VRTSvcs/conf/config

– Copy /etc/VRTSvcs/conf/config/main.cf to other cluster nodes

– Update VxVM with new host name:

# vxdctl hostid <new host name>

– Reboot the system



Vehicle Solaris Package version AIX Fileset version

Linux RPM version HP Depot version Media and Doc Names

Major 6.0.000.000 6.0. 0.0 6.0.000.000 6.0.000.000 6.0

Minor 5.1.000.000 5.1.0.0 5.1.000.000 5.1.000.000 5.1

Rolling Patch (RP) 5.1.001.000 (PSTAMP = 5.1.001.000-5.1RP1-

yyyy-mm-dd)

5.1.1.0 5.1.001.000 5.1.001.000 5.1 RP1

Maintenance Pack (MP) – 5.0x only

5.0.400.000 (PSTAMP = 5.0.400.000-5.0MP4-

yyyy-mm-dd)

5.0.400.0 5.0.400.000 5.0.400.000 5.0MP4

Service Pack (SP) 5.1.100.000 (PSTAMP = 5.1.100.000-5.1SP1-

yyyy-mm-dd)

5.1.100.0 5.1.100.000 5.1.100.000 5.1 SP1

Platform Release (PR) 5.1.010.000 (new package) 5.1.10.0 5.1.010.000 5.1.010.000 5.1 PR1

P Patch 5.1.000.100 (PSTAMP = 5.1.000.100-5.1P1-yyyy-mm-dd)

5.1.0.100 5.1.000.100 5.1.000.100 5.1 P1

Hot Fix 5.1.000.120 (PSTAMP = 5.1.000.120-5.1P1HF20-yyyy-mm-

dd)

5.1.0.120 5.1.000.120 5.1.000.120 5.1 P1 HF20

Difference between a “PR”, “RP”, “SP” and “P” release


• First Rolling Patch (RP): Initial Product Release +3 months

• First Service Pack (SP): Second Rolling Patch +3 months

• All patches are available at https://sort.symantec.com/patch/matrix

https://sort.symantec.com/patch/matrix


Rebuilding a disk group from CBR data # vxdg -Cf import testdg

VxVM vxdg ERROR V-5-1-587 Disk group testdg: import failed: Disk group has no valid configuration copies

• vxconfigrestore utility is used to restore a disk group's configuration information if this has been lost or corrupted

• Default location of backup file for configuration records : /etc/vx/cbr/bk/dgname.dgid/dgid.cfgrec

# vxdisk -o alldgs list DEVICE TYPE DISK GROUP STATUS ams_wms0_72 auto:cdsdisk - (testdg) online ams_wms0_73 auto:none - - online invalid # /usr/lib/bin/vxvm/bin/vxconfigrestore -p testdg Diskgroup testdg configuration restoration started ...... Installing volume manager disk header for ams_wms0_73 ... ams_wms0_73 disk format has been changed from none to cdsdisk. / testdg’s diskgroup configuration is restored (in precommit state). Diskgroup can be accessed in read only and can be examined using vxprint in this state. Run: vxconfigrestore -c testdg==> to commit the restoration. vxconfigrestore -d testdg==> to abort the restoration. # /usr/lib/vxvm/bin vxconfigrestore -c testdg Committing configuration restoration for diskgroup testdg.... testdg's diskgroup configuration restoration is committed. # vxdisk -o alldgs list DEVICE TYPE DISK GROUP STATUS ams_wms0_72 auto:cdsdisk ams_wms0_72 testdg online ams_wms0_73 auto:cdsdisk ams_wms0_73 testdg online

Presentation Identifier Goes Here 9


Testing Low Latency Transport : dlpiping, lltping • dlpiping

– Utility exchanges network traffic over a specific DLPI network device to test network connectivity. Node 0 Node 1

# /opt/VRTSllt/getmac /dev/qfe:0 # /opt/VRTSllt/dlpiping -c /dev/qfe:0 08:00:20:E7:DE:B2

/dev/qfe:0 08:00:20:E7:DE:B2 08:00:20:E7:DE:B2 is alive

# /opt/VRTSllt/dlpiping -s /dev/qfe:0 #/opt/VRTSllt/dlpiping -c /dev/hme:1 08:00:20:E7:DE:B2 no response from 08:00:20:E7:DE:B2

• lltping (Test should be conducted on an LLT port not used by LLT) #/sbin/lltstat –p shows the ports in use

Node 0 Node 1

# /opt/VRTSllt/lltping -s -T -p20 -v & # /opt/VRTSllt/lltping -c 0 -T -p20 -v & 6014 7375 # lltping: opening LLT dev: /dev/llt port: 20 # lltping: opening LLT dev: /dev/llt port: 20 lltping: mynodeid: 0 lltping: mynodeid: 1

lltping: send_recv_ping to node=0, pkts=20 c – client side, node ID from /etc/llthosts lltping: send pkt=0 T – Round Trip time lltping: sending a msg to node 0 p – LLT port to be used lltping: pkt=0, rtt=(0s 176us), tx=(0s 93us) v – Verbose operation ... lltping: send pkt=19 lltping: sending a msg to node 0 lltping: pkt=19, rtt=(0s 148us), tx=(0s 125us) rx=(0s 23us) lltping: pkts=20, msgsz=128, RTT:min/avg/max=88/112/195 usec



Port Name Start / Stop Script

Start Command Stop Command

- llt Low Latency Transport S70llt lltconfig –c lltconfig –U

- lmx LLT Multiplexer S71lmx lmxconfig –c lmxconfig –U

a gab Global Atomic Broadcast S92gab gabconfig –c gabconfig –U

d odm Oracle Disk Manager S92odm mount /dev/odm umount /dev/odm

b fencing I/O Fencing S97vxfen vxfenconfig –c vxfenconfig –U

o vcsmm Membership Module S98vcsmm vcsmmconfig –c vcsmmconfig –U

u,v,w cvm Cluster Volume Manager - vxclustadm –m vcs –t gab startnode vxclustadm –stopnode

q

quicklog discontinued at 5.0 - /opt/VRTSvxfs/sbim/qlogckd pkill qlogckd; qlogclustadm deinit

f cfs Cluster File System - /opt/VRTSvxfs/sbin/vxfsckd pkill vxfsckd; fsclustadm cfsdeinit

h had The main VCS engine S99vcs hastart hastop –local [-force]

How to Start/Stop ports used by SFCFS /SFRAC manually


• Solaris: In 5.1 start/stop scripts have moved under Service Management

Facility (SMF)

• lmx and vcsmm are only applicable to SFRAC installations

• cfs, cvm and quicklog (if used) are configured to be resources under VCS


Oracle Database performance impact with VxFS • Under sequential workload when AUTOEXTEND enabled in non-ODM

environment

• Severe VxFS extent fragmentation was identified to be the root cause

• Oracle file extent fragmentation can be observed using fsadm –Ef <filename>

– Oracle file can be considered highly fragmented if the “Average # Extents” field shows several thousand extents

– `/opt/VRTS/bin/fsmap –aH <filename>’ will show the size of each extent

• When system calls issued by oracle process was traced

– Shows many ‘lseek’ operations which can have an impact on performance

– Oracle file creation and extending operations also causes fragmentation

– Oracle process uses 4 LWP each writing 1MB in parallel to different offset with the file, which is not optimal for VxFS extent allocator.

• Workaround:

– Use setext command preallocate files , additionally `setext –e <Size> <filename>’, to change the geometry of the file to have fixed extent size; http://www.symantec.com/docs/TECH167404

– Oracle Bug 11892765: please allow RMAN restore to pre-allocate file extent


http://www.symantec.com/docs/TECH167404

Copyright © 2012 Symantec Corporation. All rights reserved. Symantec and the Symantec Logo are trademarks or registered trademarks of Symantec Corporation or its affiliates in the U.S. and other countries. Other names may be trademarks of their respective owners. This document is provided for informational purposes only and is not intended as advertising. All warranties relating to the information in this document, either express or implied, are disclaimed to the maximum extent allowed by law. The information in this document is subject to change without notice.

Kalyan Subramaniyam – Best Practices for top SAMG operational challenges 13

Thank you! Kalyan Subramaniyam SR. Principal Business Critical Engineer

Symantec Business Critical Services

Best Practices for top SAMG operational challenges - …vox.veritas.com/legacyfs/online/veritasdata/SM B19.pdf · 2016-07-04 · Timeouts in Veritas Cluster Server ... •Check keys

Documents