chapter-8

CT 320: Network and System Administra8on Fall 2014*

Dr. Indrajit Ray Email: [email protected]

Department of Computer Science

Colorado State University Fort Collins, CO 80528, USA

Dr. Indrajit Ray, Computer Science Department CT 320 Network and Systems Administra8on, Fall 2014

* Thanks to Dr. James Walden, NKU and Russ Wakeeld, CSU for contents of these slides

Disks


Topics


1. Disk components 2. Disk interfaces 3. Lifecycle of a disk 4. Performance 5. Reliability 6. RAID 7. Adding a disk 8. Logical volumes 9. Filesystems

Hard Drive Components


Physical Disk Geometry One head for each surface

All tracks at r = dn form a cylinder

Each sector has 512+ bytes of informa8on

One surface dedicated for posi8oning and synchroniza8on

Not all por8ons of the disk are addressable by the OS


Hard Drive Components

Actuator Moves arm across disk to read/write data. Arm has mul8ple read/write heads (oben 2/placer.)

Placers Rigid substrate material. Thin coa8ng of magne8c material stores data. Coa8ng type determines areal density: Gbits/in2

Spindle Motor Spins placers from 3600-15,000 rpm. Speed determines disk latency.

Cache 2-16MB of cache memory oben more Reliability: write-back vs. write-through


Disk Informa;on: hdparm


# hdparm -i /dev/hde /dev/hde: Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WD-WMA8C4533667 Cong={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq } RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40 BuType=DualPortCache, BuSize=8192kB, MaxMultSect=16, MultSect=o CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648 IORDY=on/o, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5 AdvancedPM=no WriteCache=enabled Drive conforms to: device does not report version: * signies the current ac/ve mode

Disk Performance

Seek Time Time to move head to desired track (3-8 ms)

Rota8onal Delay Time un8l head over desired block (8ms for 7200)

Latency Seek Time + Rota8onal Delay

Throughput Data transfer rate (20-80 MB/s)


Latency vs. Throughput

Which is more important? Depends on the type of load.

Sequen8al access Throughput Mul8media on a single user PC

Random access Latency Most servers

How to improve performance Faster disks Caching More spindles (disks). More disk controllers.


Disk Performance: hdparm


# hdparm -tT /dev/hde /dev/hde: Timing cached reads: 876 MB in 2.00 seconds

= 437.41 MB/sec Timing buffered disk reads: 88 MB in 3.08 seconds = 28.60 MB/sec

Reliability

MTBF Average 8me between failures (>1,000,000 hours).

Real failure curves Early phase: high failure rate from defects. Constant failure rate phase: MTBF valid. Wearout phase: high failure rate from wear.

Failures more likely on trauma8c events. Power on/o.

Systems oben wear out before MTBF. Average life span of a disk is about 5 years


Solid State Drives

Flash memory based solid state drives No moving parts Much higher I/O performance than hard disks Random reads also result in very high performance. Less prone to failure (more reliable)

Higher costs Uses NAND memory


NAND Flash Constraints (1)

Flash module divided in Blocks Pages Sectors E.g., 1GB 8K Blocks of 64 pages of 4 sectors of 512 bytes

Read/Write at page granularity (as disks) Writes more 8me and energy consuming than reads (factor of 3 to 10)

Pages must be wricen sequen8ally within a block Erase at block granularity

Erase-before-rewrite constraint 10 8mes more costly than a page write

A block wears out aber 106 write/erase cycles


NAND Flash Constraints (2)

Hardware constraints usually lead to make updates out-of place

Flash Transla8on Layer (FTL) is required for Address transla8on Wear leveling Garbage collec8on

FTL is a main source of unpredictability Very badly adapted to random writes Provides no guarantee against read/write failures


Disk Interfaces


SCSI Standard interface for servers.

IDE Standard interface for PCs.

Fibre Channel High bandwidth Can run SCSI or IP

USB Fast enough for slow devices on PCs.

SCSI

Small Computer Systems Interface Fast, reliable, expensive.

A bus, not a simple PC to device interface. Each device has a target # ranging 0-7 or 0-15. Devices can communicate directly w/o CPU.

Many versions Original: SCSI-1 (1979) 5MB/s Current: SCSI-3 (2001) 320MB/s

Serial Acached SCSI (SAS) Up to 128 devices Up to 2 GB/s full duplex.


IDE

Integrated Drive Electronics / AT acachment Slower, less reliable, cheap. Only allows 2 devices per interface. ATAPI standard added removable devices.

Many versions Original: IDE / ATA (1984) Current: Ultra-ATA/133 133MB/s

Serial ATA Up to 128 devices. 1.5 GB/s New standard up to 6 GB/s


IDE vs. SCSI

SCSI oers becer performance/scale Faster bus Faster hard drives (up to 15,000rpm). Lower CPU usage Becer handling of mul8ple requests.

Cheaper IDE oben best for worksta8ons. Convergence

SATA2 and SAS converging on a single standard.


Other Host Interfaces

PCI Express Speeds up to 2.0 GB/s

Fibre Channel Very high speed achievable Can support variety of network communica8on protocols such as SCSI / IP

Almost exclusively used for servers USB, Firewire

Generally much slower and hence not used for internal disks

USB 3.0 promises speeds > 3.0 GB/s


RAID

Redundant Array of Independent Disks Can be implemented in hardware or sobware. Hardware RAID controllers:

Caching Automate rebuilding of arrays

Advantages Capacity Reliability Fault-tolerance Throughput


RAID Levels


RAID 0: Striped evenly for performance. MTBF = (avg MTBF)/# disks

RAID Levels (contd)


RAID 1: Mirrored for reliability Every write goes to each disk of set.

Seek 8me halved as reads split between disks.

RAID 0 + 1: Striped + mirrored

RAID Levels


RAID 5: Striped with distributed parity. Block striping, not disk striping. Can lose one disk of set without losing data.

RAID Levels

JBOD: Concatenated for capacity. Only data on bad disk is lost, no performance penalty

RAID 3, 4 exist but not popular. RAID 3 uses byte level striping with dedicated parity disk

RAID 4 uses block level striping with dedicated parity disk

RAID 6 extends RAID 5 by using two parity blocks


Lifecycle of a HDD Blank media Low level format

Performed at the factory Par88on High level format Opera8ng system install Systems opera8on


Blank Magne;c Media

For simplicity we will use a linear model of the magne8c media

Unless we are performing electron microscopy the exact media geometry is not signicant

The blank media has only geometric structure and raw magne8c storage

Beginning End

Beginning End


Read / Write Process (simplied)

Write process Digital signals are encoded (for 8ming recovery) and

transformed into analog signals that drive the magne8c eld on the write head

Read process Analog magne8c eld is sensed, 8ming is recovered and

sampled signal is converted into digital data

Beginning End

Read / Write Head


Low Level Format

Low level formavng adds indivisible units of storage called sectors Most modern HDDs use 512+ bytes sectors

The + accounts for sector overhead bytes (dier by manufacturer) Overhead bytes provide error correc8on and 8ming recovery

func8ons Bad sectors are automa8cally remapped to redundant sectors by

the HDD controller

512 bytes"

Sectors (512 bytes plus overhead)

Redundant Sectors (Only visible to the HDD controller)

Individual sector

Sector overhead


Par;;oning

The Master Boot Record is created and includes the Master Boot Code (MBC) and the Master Par88on Table (MPT) always at sector 1 on any bootable media

The MBC is executed at boot if the HDD is designated as the boot device

The MPT contains informa8on about logical volumes including the ac8ve par88on, the par88on whose Volume Boot Code (VBC) will be executed

Each par88on has Disk Parameter Block (DPB) that stores informa8on about par88ons, le system type, date and 8me last mounted etc.

Inter-par88on gaps are a collec8on of unused sector Some sectors are unused due to addressing issues

MBC MPT VBC DPB VBC DPB

Master Boot Record (MBR)

Inter-par88on gap

Par88on #1 Volume Boot Record (VBR)

Unused sectors

Par88on #2


High Level Format (File System)

MPT now contains le system type and cluster size Cluster sizes are in increments of 512 bytes (one sector) This becomes the indivisible le size for the opera8ng system

A le system structure is created FAT creates a le alloca8on table (simple table) NTFS creates a master le table (database) Linux EXT2/EXT3/EXT4 creates a virtual le system

MBC MPT

Master Boot Record (MBR)

File System Structures Free Space

Cluster Blocks


Opera;ng System Install

Opera8ng system code, applica8on code, congura8on data and applica8on data are installed

A swap le is created for NTFS and UNIX variants (Linux, Unix, FreeBSD etc)

Boot code is wricen to the MBC (or VBC if a boot loader is used)

MBC MPT

File System Structures Free Space

Opera8ng System Code / Data

Swap Space Master Boot Record (MBR)


Adding a Disk

Install new hardware Verify disk recognized by BIOS.

Boot Verify device exists in /dev

Par88on fdisk /dev/sdb

Create lesystem mkfs v t ext3 /dev/sdb1

Add entry to /etc/fstab /dev/sdb1 /proj ext3 defaults 0 2

mount -a


When dont you need a lesystem?

Swap space mkswap v /dev/sdb1

Server applica8ons Oracle VMWare Server


Logical Volumes

What are logical volumes? Appear to user as a physical volume. But can span mul8ple par88ons and/or disks.

Why logical volumes? Aggregate disks for performance/reliability. Grow and shrink logical volumes on the y. Move logical volumes btw physical devices. Replace volumes w/o interrup8ng service.


LVM


LVM Components

Logical Volume Group (LVG) Set of physical volumes (par88ons or disks.) May be divided into logical volumes (LVs.)

LVs made up of xed sized logical extents Each LE is 4MB. Physical extents are the same size.


Mapping Modes


Linear Mapping LVs assigned to con8nguous areas of PV space.

Striped Mapping LEs interleaved across PVs to improve performance.

SeVng up a LVG and LV


1. Ini8alize physical volumes pvcreate /dev/hda1 pvcreate /dev/hdb1

2. Ini8alize a volume group vgcreate nku_proj /dev/hda1 /dev/hdb1 Use vgextend to add more PVs later.

3. Create logical volumes lvcreate -n nku1 --size 100G

nku_proj1 4. Create lesystem

mkfs v t ext3 /dev/nku_proj/nku1

Extending a LV

Set absolute size lvextend L120G /dev/nku_proj/nku1

Or set rela8ve size lvextend L+20G /dev/nku_proj/nku1

Expand the lesystem without unmoun8ng ext2online v /dev/nku_proj/nku1

Check size df k


Swap


Can use swaple instead of swap par88on dd if=/dev/zero of=/swapfile bs=1024k count=512

mkswap /swapfile Enable swap

swapon /swapfile swapon /dev/sda2

Disable swap swapoff /swapfile swapoff /dev/sda2

Check swap resource usage cat /proc/swaps

Filesystems

ext4 Gaining more popularity Can support volumes with sizes up to 1 exbibyte (260 bytes) and les up to 16 tebibytes (240 bytes)

ext3 Current most common Linux lesystem. Journaling eliminates need for fsck.

ext2 Old Linux non-fragmen8ng fast lesystem. Can be converted to ext3 by adding journal: tune2fs j /dev/sda1


Moun;ng

To use a lesystem mount /dev/sda1 /mnt df /mnt

Automa8c moun8ng Add an entry in /etc/fstab

Unmount umount /dev/sda1 Cannot unmount a volume in use.


fstab


# /etc/fstab: static file system information. # #

proc /proc proc defaults 0 0 /dev/hdc1 / ext3 defaults 0 1 /dev/hdc5 /win vfat user,rw 0 0 /dev/hdc7 none swap sw 0 0 /dev/hdc8 /var ext3 defaults 0 2 /dev/hdc9 /home ext3 defaults 0 2 /dev/hda /media/cdrom0 iso9660 ro,user 0 0 /dev/fd0 /media/floppy0 auto rw,user 0 0

fsck: check + repair fs

Filesystem corrup8on sources Power failure System crash

Types of corrup8on Unreferenced inodes. Bad superblocks. Unused data blocks not recorded in block maps. Data blocks listed as free that are used in les.

fsck can x these and more Asks user to make more complex decisions. Stores unxable les in lost+found.


chapter-8

Documents

computer science department

systems administra8on

disk latency

indrajit ray email

disk components

disk informaon

disk interfaces

disk controllers