Top Banner
Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6
29

Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Dec 28, 2015

Download

Documents

Kenneth Bennett
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Yonggang LiuUniversity of Florida

Learning the Data Management in Linux Kernel v2.6

Page 2: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

LayoutPicture of Today’s Topics

Virtual File System

The Page Cache

Accessing the Files

Ext2 and Ext3 file systems

Page 3: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

LayoutPicture of Today’s Topics

Virtual File System

The Page Cache

Accessing the Files

Ext2 and Ext3 file systems

Page 4: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Picture of Today’s TopicsVirtual File System (VFS)

Disk Caches

Ext3 FAT UFS

Mapping Layer

Generic Block Layer

I/O scheduler layer

Block Device Driver

Block Device Driver

HardDisk

HardDisk

Provides an uniform file system interface to the processes.

Keeps the most recently accessed data in RAM.Includes page cache, dentry cache and inode cache.

Specific file systems determine the physical location of the data on disk.

Offers an abstract view of the block devices. I/O operation is “block I/O”. Groups requests of data

that lie near each other on the physical medium.

Takes care of the actual data transfer by sending suitable commands to the hardware.

Page 5: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

LayoutPicture of Today’s Topics

Virtual File System• Uniform System Calls - VFS Calls• Common File Mode - VFS Objects• Interaction between Processes and VFS Objects• Files Associated with a Process

The Page Cache

Accessing the Files

Ext2 and Ext3 file systems

Page 6: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Uniform System Calls - VFS Calls

Virtual File System

Process 1 Process 2 Process 3

Disk-based file systems:Ext3, NTFS, ReiserFS,

UDF DVD FS …

Networkfile systems: NFS, Coda, AFS, CIFS,

NCP …

Special file systems:

root FS, sysfs, tmpfs, usbfs, sockfs

VFS defines the uniform System Calls:mount(), umount(), sysfs(), statfs(), chroot(), chdir(), fchdir, getcwd(), mkdir(), rmdir, getdents(), link(), rename(), readlink(), chown(), chmod(), stat(), open(), close(), creat(), dup(), fcntl(), select(), poll(), truncate(), lseek(), read(), write() …

Page 7: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Common File Model - VFS Objects

File ObjectDescribes how a process interacts with a file it has opened. Created when the file is opened. Has no image on disk.

Some fields:f_dentry

f_opf_pos

f_versionf_mapping

Dentry ObjectA directory entry

object associates a pathname to its inode. Copied to memory during the path-name

look ups.Some fields:

d_inoded_parentd_name

d_subdirsd_sb

Superblock ObjectEach file system has a superblock

recording the information of the

file system; it is copied to memory

when used.Some fields:s_blocksize

s_types_root

s_inodes_bdev

Inode ObjectIncludes all information

needed by the file system to handle a

file. Copied to memory when the

file attribute is accessed.

Some fields:i_inoi_size

i_atimei_sb

i_mapping

Page 8: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Interaction between Processes and VFS Objects

Process 1

Process 2

Process 3

File object

File object

File object

Dentry object

Dentry object

Inode object

Superblock object

disk file

fd

fd

fd

f_dentry

f_dentry

f_dentry

d_inode

i_sb

dentry cache

In the example, 3 processes have opened the same file, 2 of them using the same hard link.

Page 9: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Files Associated with a Process

fs

files

Process Descriptor

files_struct

fd_array

File object

File object

File object

fd

stdin

stdout

stderr

0

1

2

3

fs_struct

pwd

root

Stores which files are currently opened by

the process.

Stores current working directory and its own

root directory, etc.

Page 10: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

LayoutPicture of Today’s Topics

Virtual File System

• Three Kinds of Disk Cache• Page descriptors• Find a Page in Page Cache• Typical Layout of a Page• Buffer Pages

The Page Cache

Accessing the Files

Ext2 and Ext3 file systems

Page 11: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Three Kinds of Disk Cache

Page CacheThe main disk cache used by the

Linux kernel. Stores the pages containing:

• Data of regular files• Directories• Data directly read from block

device files• Data of User Mode processes

swapped out on disk• Special file systems (e.g., tmpfs)

Dentry CacheStores dentry objects

representing file system pathnames.

Inode CacheStores inode objects

representing disk inodes.

Disk Cache

Page 12: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Page descriptors

• Page descriptors are used by the kernel to keep track of the status of each page frame.

• Size: 32 bytes.• All page descriptors are stored in mem_map,

which takes about 1% of RAM.

Pagesmem_maparray

A View of Memory Address Space (Abbreviated)

Reserved(kernel)

Reserved(HD)

Page 13: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Find a Page in Page Cache

• Each inode object owns an address_space object, which has a pointer to a radix tree.

• A radix tree is a tree for looking for a page in the page cache.– An offset in the file will lead to a page descriptor position in the

radix_tree.

address_space object

page_tree

Pagedescriptor

root

node

node node node

Pagedescriptor

Pagedescriptor

Pagedescriptor

radix_tree

inode object

i_mapping

Page 14: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Typical Layout of a PagePa

ge

SectorSe

gmen

tBl

ock

Bloc

kBl

ock

Bloc

kSector

Sector

Sector

Sector

Sector

Sector

Sector

• Sector (typically 512B): The smallest unit of data when accessing the block device.

• Block (a multiple of sector size, be a power of 2, no larger than a page frame): The smallest unit of data transfer for the VFS and the file systems. It corresponds to one or more ADJASENT sectors.

• Segment (a multiple of block size): If some blocks in a page holds the data adjacent on disk, they belong to one segment. Segment is used because each block I/O takes a group of adjacent blocks on disk.

Page 15: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Buffer Pages

Buffer (block)

Buffer (block)

Buffer (block)

Buffer (block)

Page

Buffer head

Page descriptorBuffer head

Buffer head

Buffer head

Buffer pages are used to address individual blocks in a page on the disk.Buffer page = a regular page + several buffer headsBuffer pages are created only when necessary, two common cases:• When reading/writing pages of a file that are not stored in contiguous disk blocks.• When accessing a single disk block (e.g., supoerblock or inode).

Page 16: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

LayoutPicture of Today’s Topics

Virtual File System

The Page Cache

• I/O Modes• Read A File• Read-ahead• Read-ahead Considerations• Writing to a File• When to Flush Dirty Pages• Process of Flushing Dirty Pages

Accessing the Files

Ext2 and Ext3 file systems

Page 17: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

I/O Modes

Canonical ModeO_SYNC and O_DIRECT are cleared. Read() is blocking, write() terminates as soon as the data is

copied to the page cache.

Synchronous ModeO_SYNC is set. The flag affects only the write operation, which blocks the calling process until the

data is effectively written to disk.

Memory Mapping ModeThe application issues and mmap() system call to map the file to memory. So the file appears as an

array of byte in RAM.

Direct I/O ModeO_DIRECT is set. Any read or writer operation transfers data directly from User Mode address space

to disk , or vise versa, bypassing the page cache.

Asynchronous ModeThe requests for data never block the calling process; rather, they are carried on “in the background”

while the application continues its normal execution.

Page 18: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Reading A FileGet the address_space object and inode object.

Derive page’s logical index and offset, save them locally.

Start the following cycle to read all requested pages:• Read ahead if necessary.• Find the page descriptor from address_space.• If the page descriptor is NULL, allocate a new page, and start the I/O of reading

the page from disk.• Copy the page from page cache to the User Mode buffer.• Update the index and offset variables.• Continue if there are more pages to read.

All data are read. Increment the file pointer.

Update atime in inode.

Page 19: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Read-ahead

Read-ahead consists of reading several adjacent pages of data before they are actually requested. In most cases, read-ahead

significantly enhances disk performance.

Tune the read-ahead size for an

opened file: modify the

ra_pages field of file->f_ra object.

POSIX_FADV_NORMAL: 32 pages (default)

POSIX_FADV_SEQUENTIAL: 2NORMAL

POSIX_FADV_RANDOM: 0 page

Page 20: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Read-ahead Considerations

Read-ahead may be gradually increased as long as the process keeps accessing the file sequentially.

Read-ahead must be scaled down or even disabled when the current access is not sequential with respect to the previous one (random access).

Read-ahead should be stopped when a process keeps accessing the same page, or when almost all pages of the file are already in the page cache.

Page 21: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Writing to a File

Find the inode object of the file.

• Search the page in the page cache. If the page is not in the page cache, allocate a new page frame.• Allocate and initialize the buffer heads for the page.• Copy the characters from the User Mode buffer to the page.• Mark the underlying buffers as dirty so they can be written to disk later.• Check whether the ratio of dirty pages in the page cache has risen above vm.dirty_ratio (typically 40%); if so, flush a few tens of pages to disk.

If O_APPEND is set, move file pointer to the end.

Update mtime and ctime in inode; mark inode as dirty.

Start the following cycle to update all the pages involved:

All pages involved have been handled, update the file pointer.

Page 22: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

A process itself may invoke system call to write back a few tens of pages when:

When to Flush Dirty PagesThe pdflush kernel thread is responsible for writing out dirty pages in the background.

Each time, pdflush tries to flush 1024 dirty pages.A pdflush thread is waken when:

A process modifies a page in page cache, and causes the fraction of dirty pages to raise above vm.dirty_ratio (typically 40%).

The User Mode process issues a sync() system call.

The kernel fails to allocate a new buffer page or memory pool element.

The page reclaiming algorithm (LRU) wants to free more memory.

A process modifies a page in page cache, and causes the fraction of dirty pages to raise above vm.dirty_background_ratio (typically 10%).

Page 23: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Process of Flushing Dirty Pages

For each dirty inode in each superblock, do:

• If the request queue is write-congested and process does not want to block, terminate.• Find the file’s initial page to be considered.• Look up the descriptors of dirty pages from the radix_tree.• For each page descriptor got from above, flush the page to disk right away or record it (depends on the file system). • Start the disk I/O if the “record” method is used in above step.

If inode is dirty, write it back.

Continue until the specified number of pages are flushed.

Page 24: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

LayoutPicture of Today’s Topics

Virtual File System

The Page Cache

Accessing the Files

Ext2 and Ext3 File Systems• Ext2 Block Groups• Data Blocks Addressing• Allocating a Data Block• The Ext3 Journaling File System

Page 25: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Ext2 Block Groups

Block group 0Boot Block Block group n…

Super Block

Group Descriptors

Data block Bitmap

inode Bitmap

inode Table Data blocks

1 block n blocks 1 block 1 block n blocks n blocks

• Ext2 file system partitions the disk blocks into block groups of the same size.

• The maximum number of blocks in a block group is 8b blocks, b is the block size in bytes, because data block bitmap must be in one block.

• The kernel tries to keep the data blocks belonging to a file in the same block group, if possible.

Page 26: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Data Blocks AddressingGiven an offset f inside a file, how to derive the logical block number of this block on disk?1. Get the file block number by dividing f with the block size.2. Translate the file block number to the corresponding logical block number

by “Data Blocks Addressing”.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1 123 11+b/4

b/4+12…

…(b/4) 3 +(b/4)2 +

(b/4)+11

Inode -> i_block

(Blocks numbered with file block number)

“Address Mapping Table”

Page 27: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Allocating a Data BlockTo reduce file fragmentation, when allocating a block for a file, Ext2 follows this order:

Get a new block for a file near the block already allocated for the file.

Get a new block in the block group that includes the file’s inode.

Get a new block from one of the other block groups.

Preallocation of data blocksTo reduce file fragmentation, each time, the file does not get only the requested block, but rather 8 adjacent blocks. When the file is closed, all the unused preallocated blocks will be freed.

Page 28: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

The Ext3 Journaling File SystemGoal of Journaling file systemsWhen doing a consistency check, the file system only needs to look in the journal part of disk which contains the most recent disk write operations, instead of checking the whole file system. This saves large amount of time after a system failure.

Two Steps in Ext3 Journaling Process

A copy of the blocks to be written is stored in the journal.

When the I/O data transfer to the journal is completed, the blocks are written in the file system. When finish, the copies

in journal are discarded.

Discard the changes, still

constant.

Apply the changes, constant.

Page 29: Yonggang Liu University of Florida Learning the Data Management in Linux Kernel v2.6 Learning the Data Management in Linux Kernel v2.6.

Obrigado!

Thank you!

谢谢!