RozoFS Architecture Overview: RozoFS components edition 1.4 23/01/2015
Jan 13, 2016
RozoFS Architecture Overview:RozoFS components
edition 1.4
23/01/2015
metadata
Exportd
StorageSid1: host1
StorageSid1: host1
StorageSid1: host1
StorageSid1: host1
Rozofsmount
/fs1/home/
RozoFS architecture overviewComponents
Rozofsmount
Storage
/fs1/home/
Metadata server
Data path
metadata
Exportd
Storage Storage Storage
client node
control path
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 3
Storage component
[cid1,sid1]
Storageprocess
[cid2,sid1]
[cidn,sid1]
Storage Node
IP@:port
File System (e.g: XFS)
Raid 0 (0+1,5,6)
Device 0
File System (e.g: XFS)
Raid 0 (0+1,5,6)
Device n
Physical disks
Physical disks
• A storage (cid/sid) is a set of logical disks (devices) with the same capacity and performance• On the same server, RozoFS can provide storages based on different technologiesNote : configuration can be done with or without RAID controller
storage
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 4
RozoFS clusters and Volumes
Storage (host_1) Storage (host_n)
Cluster 1
Cluster 2
Cluster n
Volume 1
Cluster 1
Sid1:host_1....Sidn:host_n
Cluster 2
Sid1:host_1....Sidn:host_n
Cluster n
Sid1:host_1....Sidn:host_n
• A RozoFS cluster(cid) is an uniform set of storages (sid) in terms of disk capacity and performance• A cluster id is unique within a RozoFS system
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 5
Mapping filesystems on volumes
Volume 1Cluster 1 Cluster n Volume 2Cluster n+1 Cluster n+p
Filesystem 1 Filesystem j Filesystem j+1 Filesystem j+k
• RozoFS supports configuration with multiple volumes• A Volume can host more than one File system (thin provisioning)• There are quotas (hard and soft) per file system• A File system is identified by an unique id (eid) within the configuration
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 6
File localization within a filesystem
Volume 1Cluster 1 Cluster n
Filesystem 1 Filesystem j
MojetteTransform
Projections
Storage nodes
Storage(cid/sid)
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 7
RozoFS configuration
Eid1:/metadata/fs1,vid=1
Cluster n
Sid1:host1
Sid2:host2
Sid3:host3
Sid4:host4
Volume 1
Eid1:/metadata/fs1,vid=1
Cluster n
Sid1:host1Sid2:host2Sid3:host3Sid4:host4
Volume i
Cluster 1
Sid1:host1Sid2:host2Sid3:host3Sid4:host4
Storage_conf
Listening_endpoints (@IP:port)
[cid1,sid1]:pathname1,device_count[cid2,sid1]: pathname2,device_count
Exportd node Storage node
fstab
rozofsmount mount_path rozofs export@IP,/metadata/fs1
rozofsmount node
StorageSid1: host1
StorageSid1: host1
StorageSid1: host1
StorageSid1: host1
Rozofsmount
/fs1/home/
Eid1:/metadata/fs1,vid=1
Cluster n
Sid1:host1
Sid2:host2
Sid3:host3
Sid4:host4
RozoFS architecture overviewComponents
Volume 1
conf
Eid1:/metadata/fs1,vid=1
Rozofsmount
StorageSid1: host1
/fs1/home/
Metadata server
Data path Cluster n
Sid1:host1Sid2:host2Sid3:host3Sid4:host4
RozoFS Export conf.
Volume i
Exportd
Cluster 1
Sid1:host1Sid2:host2Sid3:host3Sid4:host4 Storage
Sid2: host2Storage
Sid3: host3Storage
Sid4: host4
client node
control path
Typical RozoFS deployments
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 10
RozoFS native mode (scale-out NAS)
GigE infrastructure (shared by
Data storage and metadata)
NativeprotocolNative
protocolLinux Client with RozoFS
clients/applications
Storage and metadata
Rozofsmount
Storage
Storage
Storage
Storage
Exportd
Note: the exportd function can reside on some storage nodes also.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 11
RozoFS Cluster : NAS mode
GigE infrastructure
(data storage and metadata)
SMB,NFS,AFP..
SMB,NFS,AFP..
Windows, Linux, UNIX and Apple clients
GigE Infrastructure
clients/applicationsRozofsmount
Rozofsmount
Rozofsmount
Rozofsmount
Storage
Storage
Storage
Storage
Exportd
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 12
Virtualisation solution with RozoFS: CloudStack+KVM
GigE infrastructure
(data storage and metadata)
+
Standard GigE Infrastructure
Niveau clients/applications
ExternalNetworkExternalNetwork
Rozofsmount Storage
Rozofsmount Storage
Storage
Rozofsmount Storage
Rozofsmount
Exportd
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 13
RozoFS basic exchanges
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 14
RozoFS basic exchanges inter components interfaces
Rozofsmount
Storcli 1 Storcli n
StorageSid1: host1Storage
Sid1: host1StorageSid1: host1Storage
Sid1: host1
Cluster conf.
Metadata ops./ mount
Storage monitoring
Projections deletion
Read/writetruncate
Met
adat
a Se
rver
CLIENT NODE
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 15
Rozofsmount
Eid1:/metadata/fs1,vid=1
Cluster n
Sid1:host1
Sid2:host2
Sid3:host3
Sid4:host4
RozoFS basic exchanges Filesystem mounting
Volume 1
conf
Eid1:/metadata/fs1,vid=1
Rozofsmount
/fs1/home/Metadata
server
Cluster n
Sid1:host1Sid2:host2Sid3:host3Sid4:host4
RozoFS Export conf.
Volume i
Exportd
Cluster 1
Sid1:host1Sid2:host2Sid3:host3Sid4:host4
Mount /metadata/fs1
Rozofsmount –H exportd_host –E/metadata/fs1 /fs1/home/
1
2
3Clusters list
Storcli 1
StorageSid1: host1
StorageSid2: host2
StorageSid3: host3
StorageSid4: host4
4TCP open
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 16
RozoFS basic exchanges file creation
Rozofsmount
Metadata Server(exportd) Open(« /fs1/home/foo »,O_CREAT|O_RDWR,0640)
Application/VFS
Volume distribute(EID)
Cluster 1
Sid1:host1Sid2:host2Sid3:host3Sid4:host4Sid5:host5Sid6:host6……
1) Get the volume associated with EID (VID)2) Get the Cluster list(CID)3) Get 4 storages for a Cluster(SID)
Export_mknod
1) allocate a unique file Id (FID)2) Volume distribute(EID)3) Insert(FID,« foo ») in parent directory4) write new file attributes5) update parent attributes
DISK
Eid1:/metadata/fs1,vid=1
mknod(EID,parent_fid,« foo »,O_RDWR,0640)
attrs(FID,cid1:{sid1..sid4},0640,etc…}
File_descriptor 1 4
2
3
FID : Unique File Identifier DescriptorParent_fid: FID of the parent directory
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 17
RozoFS basic exchanges file opening
Rozofsmount
Metadata Server(exportd)
Open(« /fs1/home/foo »,O_RDWR,0640)
application
Directory entries cache
Parent_dir.
Name1->FID1Name2->FID2foo -> FID3…….
Export_lookup
1)Get file FID from parent directory (cache or disk)
2) Get File attributes (cache or disk)
DISK
Eid1:/metadata/fs1,vid=1 lookup(EID,parent_fid,« foo »,O_RDWR,0640)
File_attributes(attrs3)
lookup
1 9
3
4
attributes cache
FID1->attrs1FID2->attrs2FID3->attrs3…….
FID3cid:{sid1,sid2,sid3,sid4}Atime,mtime……
attrs3 open Fd 12 5 6 8
File descriptor allocator
FID3cid:{sid1,sid2,sid3,sid4}Atime,mtime……
Fd 1
7
Fd 1
VFS
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 18
RozoFS basic exchanges synchronous file write
Len = pwrite(fd,offset,size,buffer)
Application/VFS
1 10
FID3cid:{sid1,sid2,sid3,sid4}File sizeAtime,mtime……
Fd 1
write(fd1,offset,size,data)
StorageSid4: host4
Mojette Transform
Forward
Write projections1) Generate projections2) Send all the projections write in parallel3) Wait for all the write responses
Write1) Find the context associated with fd12) Submit data to write to storcli3) Wait for end of write4) Update the blocks on exportd3) Return written to upper layer
StorageSid3: host3
StorageSid1: host1
StorageSid2: host2
write(FID3,offset,data,size) Size or errcode
write(FID3,prj1)
status
Prj1,prj2,prj3
Data,size2
65
3
4
7
Size or errcode
Redundancy level (2+1):2 reads3 writes
write(FID3,prj2) write(FID3,prj1)
status status5 5
6 6
Write_blocks(file attributes update)
1) Update time information2) Update size if greater3) Update cache and disk
DISK
Eid1:/metadata/fs1,vid=1
attributes cache
FID1->attrs1FID2->attrs2FID3->attrs3…….
Wr_blks(EID1,FID3,offset,size)
Attrs(attrs3)
8
9
Metadata server(exportd)
Redundancy level (2+1):2 reads3 writes
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 19
RozoFS basic exchanges file read
Rozofsmount
Len = pread(fd,offset,size,buffer)
Application/VFS
1 8
FID3cid:{sid1,sid2,sid3,sid4}File sizeAtime,mtime……
Fd 1
Pread(fd1,offset,size)
StorageSid4: host4
Storcli
Mojette Transform Inverse
Read projections
1) Send parallel read requests2) Wait for projection data returned from storages3) Rebuild initial block
Read
1) Find the context associated with fd12) Request data to storcli3) Return requested data to VFS
StorageSid3: host3
StorageSid1: host1
StorageSid2: host2
Read(FID3,offset,size) Data,length
Read(FID3,prj1,offset_prj Read(FID3,prj2,offset_prjprj1 prj2
Prj1,prj2
Data,length
2
3 34 4
5
6
7
Data,length
Redundancy level (2+1):2 reads3 writes
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 20
RozoFS basic exchanges file deletion
Rozofsmount
Metadata Server(exportd)
unlink(« /fs1/home/foo »)
Application/VFS
File deletion
1)Remove the file from the parent directory (disk and cache)2) Delete the attributes of the file (disk and cache)3) Update the parent attributes4) Insert file reference in the trash (list and disk)
DISK
Eid1:/metadata/fs1,vid=1
unlink(EID,parent_fid,«foo »)
Parent_attributes
1 4
2
3
Trash thread
FID6->attrs6FID7->attrs7FID3->attrs3…….FID3
cid:{sid1,sid2,sid3,sid4}……
errcode
Trash list
attrs3Storage
Sid1: host1
StorageSid2: host2
StorageSid3: host3
StorageSid4: host4
unlink(parent_fid,« /fs1/home/foo »)
Projections deletions(FID3)
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 21
Storaged
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 22
Storage node componentsStoraged processesStoraged node start up sequenceMulti device feature
IntroductionExtendable storages Faster rebuild processSpreading files among devicesProjection file structure Fault detection
Configuration file
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 23
Storage node components
storaged rozolauncher
storio
Disk thread1
Disk thread16
Disk Request dispatcher
Local file systems (i.e.:XFS)
rozo
laun
cher
Config. file TCP listeningendpointsTCP listening
endpoint
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 24
Storaged processes
•Storaged Provides the exportd with disk space information (volume balance) Provides the storcli process with the listening ports of the storio Takes care of the projection files deletion Controls the storio processes
•Storio The storio software is split into 2 processes types:
a main process handling TCP connections from the clients, receiving and decoding the requests and posting them in a queue.
several disk threads reading from a queue the requests posted by the main thread, processing them and sending a response back to the main thread.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 25
Storaged start-up sequence
Storaged process
rozolauncher
n
Exit()
rozolauncher start /var/run/launcher_storio_slave_<hostname>_<storio_id>.pid storaged -i <storio_id> -c <config_file>
storaged -c <config_file> -H <config_file>
Storio process
storaged -i <storio_id> -c <config_file>
rozolauncher
rozolauncher start /var/run/launcher_storaged_<hostname>.pid storaged
-c <config_file> -H <hostname>
Exit()
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 26
Storage node: Multi-device featureIntroduction
• A device can be either: a physical hard drive or a logical volume made of one or several hard drives managed thanks to hardware and/or
software (ie LVM volume, RAID 0 behind a controller,...)
Formerly a storage had a root path per cid/sid tuple where to store the data files. Now the storage has N devices numbered from 0 to N-1 that are mounted as directories
under the root path to provide access to the devices. It is up to the storage to decide which device to use for each data file.
• Multi-device goals: Provide a scale-up capability Get rid of RAID 5/6 to avoid their weaknesses when a large number of disks are grouped Provide a faster rebuild time by limiting the number of hard drives per local filesystem Provide the capability to shrink a cluster
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 27
Storage node: Multi-device featureExtendable storages
•Before the multi device feature, a storio had a root path that used to be a logical volume made of a bunch of hard drives in RAID5 or RAID6:
But it is not possible to extend a RAID 5 or 6 cluster handled by a hardware RAID controller.
When there was no space left under the root path, it was not possible to add some disk space to this cluster id/storage id.
When adding disks to the server, one had to create a new rozoFS cluster.
• With the multi device feature the storio handles several devices under its root path, and one can add a device to the storio.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 28
Storage node: Multi-device feature
Faster disk rebuild process
•Before the multi device feature when the RAID bunch of disk was failed, every data of the storage had to be rebuilt.
For instance on a bunch of 12 disks of 4TB in RAID 6, when 3 disks have failed, the 10 x 4 TB of data have to be rebuilt.
• With the multi device feature, one can group the disks in 6 RAID 0 clusters of 2 disks. When one disk fails, the data on the other devices is
still available. Only one device is lost and has to be rebuilt. A 2 disks
space is to be rebuilt. While the rebuild process may occur more often, it will be faster.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 29
Storage node: Multi-device feature
Spreading files among devices
•It is up to the storage: to distribute the projection files among its devices trying to equalize the free space on each
device , to remember where the projection files are located.
•RozoFS is able to store files of a size up to 8 TB: 8 TiB of data means 4 TiB of projection in layout 0, 2 TiB in layout 1 and 1 TiB in layout 2. Since a device is limited in size, a projection has to be spread among the different devices of a
storage. •Each file is sub-divided in 64 GB chunks (of user data) and can have a maximum of 128 chunks.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 30
Storage node: Multi-device feature
Spreading files among devices
•On the fly chunk allocation: The chunks of a file that have not yet been written have no device allocated to be written on. Each chunk of a file is allocated a device where to reside by the storage at the time it is
written for the first time. The whole chunk will then be written on this device.
•The size of a chunk of user data is 64 GiB. The size of a projection chunk depends on the layout:
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 31
Storage node: Multi-device feature
Projection file structure
• The storage needs to remember the devices assigned to each chunk.• The projection file formerly (in release 1) had a 8K header followed by the projected data blocks. It is now
split into two types of files: a header or mapper file that contains the former header complemented with a list of 128 devices
allocated (or not yet allocated) for the chunks. chunks files containing the projections of up to 64 GiB of user data.
• The location of the header file is given by the result of a hash on the FID modulo the number of devices on which header files can be found. The number of devices a storio handles can be increased, while the hash function on a FID must always
give the same location. For this reason, the number of devices holding header files is determined at the storage installation and
can never be changed. Later added devices do not hold header files.
• As all data on a device can be lost when one of its disk fails, it is mandatory to replicate the header files on several devices. There is so 3 new configuration parameters per cid/sid:
device-mapper = the number of devices hosting header files. device-redundancy = the number of replica of header files. device-total = the number of devices holding chunks of projection files.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 32
Storage node: Multi-device feature Projection file structure
•Example of an extract of a storage configuration file:
• The device-mapper must not be changed from the first storage installation.
• The device-total is increased when adding new devices.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 33
Storage node: Multi-device feature
Projection files location (path)
• 2 header files are located on device 3 and 0.• 2 chunks are written; the first on device 5 and the second on device 2.• By the way, one may notice that the layout and distribution do not appear any more in the file
path. The new file path is built the following way: <root_path>/<device id>/<type>_<spare>/<slice>/<FID> <type> is ‘bins’ for chunk of projection files and hdr or .hdr files. spare is either 0 or 1 <slice> is a hash computed from the FID to spread all the files among several directories.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 34
Storage node: Multi-device feature
Projection files location (path)
A file that is in release 1 would have been located under :
With multi-device :
• 2 header files are located on device 3 and 0.• 2 chunks are written; the first on device 5 and the second on device 2.• The file path is built the following way:
<root_path>/<device id>/<type>_<spare>/<slice>/<FID> <type> is ‘bins’ for chunk of projection files and hdr or .hdr files. <spare> is either 0 or 1 <slice> is a hash computed from the FID to spread all the files among several directories.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 35
Storage node: Multi-device feature
Fault detection
•In the storio, each time an abnormal error is encountered while accessing data on a device an error counter is incremented.
This counter should reveal some problem on some sector of the device.
•A periodic task checks that every device is still accessible in read and write. In case it is not, a failure counter is increased.
•The error and failure counters can be read through the rozodiag interface thanks to the command “device”.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 36
Storage node: Multi-device feature
Fault detection: disk fault example 1
• This display shows the available blocks on each device as well as the error and failure counters.
• In this example, the device 1 has encountered some errors and is now no more accessible in read and write.
• This device may need to be rebuilt.
Note :The Nagios plug-in nagios_rozofs_storaged.sh checks these error/failure counters. When an error is raised, the plug-in returns a critical status and shows off the list of faulty devices that may require a rebuild.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 37
Storage node: Multi-device feature
Fault detection: disk fault example 2• This next display shows off errors on an other storage. At the end of the display, the “Faulty
FIDs:” paragraph displays FIDs that have encountered a problem.
• The displayed FID has encountered a fault that prevents its writing or reading.• Since only one line is displayed and not 10 or mores, one can guess that the device is not completely failed, but some disk sector used by the displayed FID may be corrupted.• In this case, rebuilding this FID only could solve the problem.
Note: the output format is “-s <cid/sid> -f <FID>” which is the input format of the rebuild command described later.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 38
Storaged configuration file
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 39
Storaged configuration file
• Threads: number of threads associated with the storio process. The default number is 4. The maximum number of threads is 16.
• Nbcores: maximum number of core files that is kept for a storio process. By default 2 core files are supported.
• Storio mode: Single mode (« single »): One storio process for all cid/sid. It listens on all the endpoints defined in the listen section
of the configuration file.
Multiple mode (« multiple »): The Master storaged starts one storio per cid (cluster) defined in the configuration file. As
the for the case of the single mode, the storio listens on each all the listening addresses of the listen section. However a rule is applied on the port number: Listening_port_number = config_file_port_number+<cid>
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 40
Storaged configuration file
• Crc32c configuration: Optionally, each transformed blocked can be protected with a CRC32C. The goal is to detect
and repair blocks for which there is a CRC32 error whatever the source of the error (hardware or software).
It is recommended to enable the CRC32 control in particular when using non entreprise drives. The CRC32C generation and control takes place on the storio process and the self healing is controlled by the storlci process.
Upon a read failure due to a CRC32 error, the block in error is regenerated once the initial block has been fully rebuilt. The repair takes place on the storcli.
crc32c_check: assert to « True » when CRC must be check on each block. crc32c_generate: assert to « True » for generating a crc32 on each block written on disk crc32c_hw_forced: assert to « True » to force the usage of the crc32 code that is hardware
assisted. That option MUST used only for the case of Virtual Machine for which the reported hardware features supported by a CPU is incomplete. It is typically the case with VirtualBox. When a CPU does not provide hardware for CRC generation , turning on the CRC generation and check will hurt the overall performances of the system.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 41
Storaged configuration file
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 42
Rozofsmount/Storcli
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 43
Storcli architecture overview
Shared Memory
AF-UNIX
Moj. Inverse
Read/Write/truncateRequets Dispatcher
Thread selector
Mojettewrite
threads
Mojetteread
threads
Moj. Fwd
AF-UNIX
Block_sz
Th_enable
Thread selectorTh_enable
Block_sz
Storage nodes load balancerTCP TCP
North bound AF-UNIX interface
South bound interface
Rozofsmount AF-UNIX channel• Storcli receives request on AF-UNIX socket from the north bound interface
• Data payload is read/write within a shared memory
• Mojette transform pass-through mode depends on Size of the block to transform State of the thread (read or write)
• A request is processed by the dispatcher: Takes care of the dependency
between requests Takes care of the communication
with storage nodes Takes care of the Mojette transform
activation• The north interface handles the load
balancing group associated with each storage node.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 44
Storaged processes
•Storaged Provides the exportd with disk space information (volume balance) Provides the storcli process with the listening ports of the storio Takes care of the projection files deletion Controls the storio processes
•Storio The storio software is split into 2 processes types:
a main process handling TCP connections from the clients, receiving and decoding the requests and posting them in a queue.
several disk threads reading from a queue the requests posted by the main thread, processing them and sending a response back to the main thread.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 45
Rozofsmount/storcli start-up sequence
rozolauncher start /var/run/launcher_rozofsmount_<rozofsmount_id>_storcli<storcli_id>.pid storcli –H <exportd_host_list> -i <storcli_id> -c <storlcli_options>
Rozofsmount process
rozolauncher
1 or 2
Exit()
Storcli process
storcli –H <exportd_host_list> -i <storcli_id> -c <storlcli_options>
rozofsmount –H <exportd_host_list> -E <export_path> <local_path> <mount_options>
Upon a fatal error, a storcli is automatically restarted
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 46
Storcli start configuration
• A storcli process provides information related to its starting configuration. This information is accessible thanks rozodiag:
• host: hostname of the export node. More that one address might be provisioned. It is typically the case when the RozoFS is deployed in a routing environment.
• Module index: reference of the storcli within the rozofsmount that owns it• Site: site number. Revelant for the case of the geo-replication only.• Nb_cores: number of core files that can be generated by the storcli process.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 47
Storage node and cluster states seen by Storcli
_________________________________________________________[127.0.0.1:50004] rzdbg> cid_state____[storcli 1 of rozofsmount 0]__[ cid_state]____ cid | state |--------+--------------+ 1 | UP |_________________________________________________________[127.0.0.1:50004] rzdbg> storaged_status____[storcli 1 of rozofsmount 0]__[ storaged_status]____ cid | sid | hostname | lbg_id | state | Path state | Sel | tmo | Poll. |Per.| poll state |------+------+----------------------+----------+--------+------------+-----+-------+-------+----+--------------+ 001 | 01 | localhost1 | 0 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 02 | localhost2 | 1 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 03 | localhost3 | 2 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 04 | localhost4 | 3 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 05 | localhost5 | 4 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 06 | localhost6 | 5 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 07 | localhost7 | 6 | UP | UP | YES | 0 | 0 | 50 | IDLE | 001 | 08 | localhost8 | 7 | UP | UP | YES | 0 | 0 | 50 | IDLE |_________________________________________________________[127.0.0.1:50004] rzdbg>
• Storcli provides status related to cid/sid connectivity.• A cid (cluster) is considered to be up if there is at least one sid which is reachable from the
storcli• A cid/sid is in the UP state for the following condition:
There is at least one TCP connection of its load balancing group that is UP The remote end has replied to a NULL-poll requests
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 48
Mojette threads statistics and on-line configuration
• Rozodiag provides the administrator with the capability to change the Mojette thread settings.
• Mojette thread configuration can be changed on the fly. This concern the enable flag and the buffer size threshold for entering the threads.
_________________________________________________________[127.0.0.1:50004] rzdbg> MojetteThreads ?
____[storcli 1 of rozofsmount 0]__[ MojetteThreads ?]____
usage:
MojetteThreads reset : reset statistics
MojetteThreads <read|write> enable : enable Mojette threads
MojetteThreads <read|write> disable : disable Mojette threads
MojetteThreads : display statistics
MojetteThreads size <count> : adjust the bytes threshold for thread activation (unit byte)
_________________________________________________________
• Mojette threads menu
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 49
Mojette threads statistics and on-line configuration
_________________________________________________________[127.0.0.1:50004] rzdbg> MojetteThreads
____[storcli 1 of rozofsmount 0]__[ MojetteThreads]____
Thread activation threshold: 65536 bytes
max pending Mojette req cnt: 4
receive empty counter : 0
read/write thread status : DISABLE/ENABLE
Thread number | 0 | 1 | 2 | 3 | TOTAL |
Read Requests |__________________|__________________|__________________|__________________|__________________|
number | 0 | 0 | 0 | 0 | 0 |
Bytes | 0 | 0 | 0 | 0 | 0 |
Cumulative Time (us) | 0 | 0 | 0 | 0 | 0 |
Average Bytes | 0 | 0 | 0 | 0 | 0 |
Average Time (us) | 0 | 0 | 0 | 0 | 0 |
Average Cycle | 0 | 0 | 0 | 0 | 0 |
Throughput (MBytes/s) | 0 | 0 | 0 | 0 | 0 |
Write Requests |__________________|__________________|__________________|__________________|__________________|
number | 100 | 100 | 100 | 100 | 400 |
Bytes | 26214400 | 26214400 | 26214400 | 26214400 | 104857600 |
Cumulative Time (us) | 5758 | 4095 | 4414 | 4274 | 18541 |
Average Bytes | 262144 | 262144 | 262144 | 262144 | 262144 |
Average Time (us) | 57 | 40 | 44 | 42 | 46 |
Average Cycle | 126072 | 89207 | 98851 | 93165 | 101824 |
Throughput (MBytes/s) | 4552 | 6401 | 5938 | 6133 | 5655 |
|__________________|__________________|__________________|__________________|__________________|
_________________________________________________________
[127.0.0.1:50004] rzdbg>
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 50
Rozofsmount shared memory
• The rozofsmount uses an AF-UNIX socket and a shared memory to communicate with its associated storcli processes. It might support up to 2 storcli.
• The shared memory key of the shared memory is built as follows 0x524f5a<rozofsmount_instance><storcli_instance>
• The following illustrates the case where there are 2 rozofsmounts, each of them owning 2 storclis.
root@debian:/home/rozofs/off_mgeo/src/exportd# ipcs -m
------ Segment de mémoire partagée --------clé shmid propriétaire perms octets nattch états 0x00000000 0 root 644 80 2 0x00000000 32769 root 644 16384 2 0x00000000 65538 root 644 280 2 0x00000000 98307 didier 600 33554432 2 0x4558504f 294916 root 666 1216 9 0x524f5a30 163845 root 666 8421376 2 0x524f5a31 196614 root 666 8421376 2 0x524f5a32 229383 root 666 8421376 2 0x524f5a33 262152 root 666 8421376 2 .
Note: if for any reason two rozofsmount with the same instance id are started, the system will fail for all the I/O operations since the two rozofsmount are using the same shared memory.
_________________________________________________________
[127.0.0.1:50004] rzdbg> shared_mem
____[storcli 1 of rozofsmount 0]__[ shared_mem]____
active | key | size | cnt | address |
--------+-----------+---------+------+----------------+
YES | 1380932144 | 263168 | 0032 | 0x7f320c2dc000 |
YES | 1380932145 | 263168 | 0032 | 0x7f320bad4000 |
_________________________________________________________
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 51
Impact of the export configuration on storcli process
The storcli process communicates with storage nodes. For that purpose it needs to be aware of the storage nodes that are used by the volume associated to the
file system referenced by rozofsmount. To address such case, storcli gets the storage configuration from the export:
Once it gets the list of the storage node it establishes the TCP connection towards these nodes It periodically polls the export node to detect a change in the exportd configuration.
Changing the export configuration of the export such as adding/removing storage does not imply the restart of the storcli process. From the storcli standpoint the procedure is the following: Upon a exportd configuration polling, the exportd informs that the storcli configuration is out of date It Gets the new configuration It Establishes to any new storage detected within the configuration It Removes the connection that are no more referenced in the configuration The process does not stop the I/O operations that are in progress.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 52
Storcli I/O contexts
• The storcli process uses storcli buffer for processing the requests submitted by rozofsmount.• A storcli process can handled up to 32 transactions in parallel.• By default storcli attempts to process the transactions in parallel:
On each transaction submitted by rozofsmount, the storcli process checks if there is no overlap with an already on going transaction
When there is an overlap, the current transaction is inserted in the ring Each a transaction ends, the storcli check among the pending request of the ring if the overlap condition
has disappeared When the overlap condition disappears, the waiting transaction(s) is(are) processed The storcli buffer statistics reports the number of transaction fro which of collision occurred.
• However is to possible to force the storcli to operate in a serialized mode• Any error during the processing of a transaction is logged internally by incrementing the appropriated error
counter.• When a storcli is in the idle state, the number a allocated transaction contexts MUST be 0.• The information related to the state of the storcli buffer is accessible thanks rozodiag
[127.0.0.1:50004] rzdbg> storcli_buf ?
____[storcli 1 of rozofsmount 0]__[ storcli_buf ?]____
usage:
storcli_buf : display statistics
storcli_buf serialize : serialize the requests for the same FID
storcli_buf parallel : process in parallel the requests for the same FID
_________________________________________________________
[127.0.0.1:50004] rzdbg>
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 53
Storcli context statistics
[127.0.0.1:50004] rzdbg> storcli_buf
____[storcli 1 of rozofsmount 0]__[ storcli_buf]____
number of transaction contexts (initial/allocated) : 64/0
Statistics
serialize mode : NORMAL
req submit/coll: 400/0
FID in parallel: 229
buf. depletion : 0
ring full : 0
SEND : 0
SEND_ERR : 0
RECV_OK : 0
RECV_OUT_SEQ : 0
RTIMEOUT : 0
EMPTY READ : 0
EMPTY WRITE : 25600
Buffer Pool (name[size] :initial/current
North interface Buffers
small[ 1024] : 64/64
large[264192] : 64/64
South interface Buffers
small[ 1] : 1/1
large[163840] : 1024/1024
_________________________________________________________
[127.0.0.1:50004] rzdbg>
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 54
Metadata Server (exportd)
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 55
Exportd highlights
• The exportd handles the metadata operations of several exported file system
• For better performances the exportd supports up to 8 processes
• Each process is responsible of a subset of the exported file systems
• The exportd is responsible of the allocation of the storaged servers a creation time
• The exportd controls the projections file remove upon file deletion
• The exportd provides user and group disk quota accounting and enforcement
• The exportd operates in active/stand-by mode thanks pacemaker/DRBD
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 56
Metadata Server High level architectureExportd server
Export S 1
Export S 8
confrozofsmount
interface mount
Export M
storaged
interface metadata
interface monitor
interface remove
client node
storage node
Metadatadisks
One exportd server can handle the metadata of more than one file system One exportd server owns one Master Exportd that controls up to 8 Slave Exportd processes A RozoFS configuration can include more than one exportd server
Dentry filesInodes files
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 57
Metadata Server interfaces
• Mount interface:• That interface provides the following services for the rozofsmount client:
filesystem mount: the goal is to provide the rozofsmount clients with the information related the filesystem that is associated with the rozofsmount (cluster id, storage nodes IP information, list of the Export Slices endpoints (IP@ and port), etc...
• Metadata interface:• That interface is used for all the metadata operations related to a file system: file/directory creation,
file/directory lookup, get/set file/directory attributes, etc...
• Monitor interface:• That interface is used by the Master Export to collect statistics information on the storage nodes that are in
the scope of its configuration.
• Remove interface• That interface is used by Slice Exports for removing the projections on storage nodes upon file deletion.
• The metadata of a file system are organized by slices. The slice notion is internal to the RozoFS metadata server. The goal of the slice is to distribute the processing of the metadata operations among several Slice Export process to increase the throughput of the metadata server.
• The metadata server is organized around 256 slices. It might be possible to associated one Slice Export process per slice. The current configuration supports upon to 16 Slice Export Processes.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 58
Slave exportd process architecture
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 59
Slave exportd process Metadata server main thread
• main thread• The main thread is the entry point of the Slice Export process, it has the following
characteristics:• supports up to 512 TCP connections on the metadata interfaces: all the messages are processed
by the "metadata srv" building block;• It dispatches the requests towards:
the "metadata srv" module: that modules processes all the operations related to file/directory: creation/deletion, update, etc...
The quota module: that module is interfaced by clients running on the same host as the exportd and is responsible of all the operation related to the quota configuration (user and group).
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 60
Slave exportd processVolume balance thread
• volume balance thread The role of the volume balance is to gather statistics from the storage nodes in
order to establish the list of storage nodes on which files can be allocated (file distribution).
This is a periodic task that exists on each Export Slave process The TCP connection opens towards the storage node are ephemeral:
as soon as the thread gets the information from a storage node, it closes its TCP connection.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 61
Slave exportd process Dirent write-back thread
• Dirent write-back thread• The role of the thread is to push pending updates related to modification of the
dentries file (dirent files).• It is a periodic task whose the period can be adjusted thanks rozodiag tool. By
default the period is 1 second.• The write-back cache can be disabled, in that case all the modification on dirent files
are done synchronously versus asynchronously when the thread is enabled.
Dirent writeback
cache
DISK
Update dirent_header
Dirent writeback
thread
Update chunk
MetaData SRV.
period
Enable/disable
Cache Handler.
insert
Disk write (disable)
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 62
Slave exportd process Dirent write-back thread: thread menu and statistics
•Dirent write-back thread statistics
• Dirent write-back thread menu
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 63
Slave exportd process Dirent write-back thread: dirent cache statistics
[127.0.0.1:52001] rzdbg> dirent_cache____[exportd-S1 ]__[ dirent_cache]____Malloc size (MB/B) : 0/8720Level 0 cache state : EnabledNumber of entries level 0 : 2hit/miss : 9/2collisions cumul level0/level1 : 0/0LRU stats global cpt (ok/err) : 0/0coll cpt (ok/err) : 0/0collisions Max level0/level1 : 0/0Name chunk size : 64Name chunk max : 9Sectors (nb sectors/size) : 521/266752 Bytes------------------+----------------------+--------------+ field name | start sector(offset) | sector count |------------------+----------------------+--------------+ header | 0 (0x0 ) | 1 | name bitmap | 1 (0x200 ) | 1 | hash buckets | 2 (0x400 ) | 1 | hash entries | 3 (0x600 ) | 6 | name chunks | 9 (0x1200) | 512 |------------------+----------------------+--------------+
------------+-------------+---------------------+ file_limit | mask | put count |------------+-------------+---------------------+ 10000 | 1 | 2 | 100000 | f | 0 | 0 | fff | 0 |------------+-------------+---------------------+File System usage statistics:Total Read : memory 0 MBytes (0 Bytes) requests 0Total Write: memory 0 MBytes (0 Bytes) requests 0
WriteBack cache statistics:state :EnabledNB entries :4096hit/miss/flush : 0/0/0invalidate : 0total Write : memory 0 MBytes (13312 Bytes) requests 4 ejected chunks 0
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 64
Slave exportd process Remove bins thread
• RozoFS file deletion overview• Upon receiving a file deletion from a client (rozofsmount), the export slave process cleans up the
entry within the parent directory (dirent file) but differs the deletion of the associated projections on storage nodes.
• When a file is deleted, an entry is created in a dedicated directory (dir_trash) whose management is the same as the one used for the i-nodes.
• The file founds while that directory contents the reference of the FID associated list of the storage nodes from which the file has to be removed.
• An entry is also created in memory for performance purposes. • A periodic thread takes care of the in memory list of the files to delete and addresses each
storage node referenced with the FID in order to release the projection files from the storage nodes.
• A file is fully deleted once all its projections have been removed. In that case, its reference is removed from the associated file within the trash directory of the its export.
• The period and the maximum numbers of file deleted per period are configurable: Default period : 5 second, max. files : 500
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 65
Slave exportd process Remove bins thread: statistics
• This information is accessible thanks rozodiag • Any changes done on the thread configuration will not survive to the process restart
• Remove bins thread menu:
[127.0.0.1:52001] rzdbg> trash ?
____[exportd-S1 ]__[ trash ?]____
usage:
trash limit [nb] : number of file deletions per period (default:500)
trash : display statistics
Remove bins thread menu:
• Remove bins thread statistics:
[127.0.0.1:52001] rzdbg> trash
____[exportd-S1 ]__[ trash]____
Trash thread period : 2 seconds
file deletion per period : 500
delete stats (pending/done) : 0/0
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 66
Slave exportd process I-node tracking periodic thread
• The role of that thread is to check periodically the tracking files (files that contains the inodes) to figure out if files can be either truncated or/and removed in order to free disk space on disks that supports the exportd files.
• That thread is common to all the "eid" handled by the slave exportd.• The period of the thread can be adjusted thanks rozodiag. By the default the period is 30
seconds• That thread can be de-activated but it is not recommended since the i-node tracking files are
never removed nor truncated after i-node deletion.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 67
Slave exportd process Inode tracking periodic thread: statistics
[127.0.0.1:52001] rzdbg> inode_trck
____[exportd-S1 ]__[ inode_trck]____
period : 30 second(s)
statistics :
- buffer size :16384
- number of buffers :16
- update requests :5
- update errors :0
- flush errors :0
- activation counter:2662
- average time (us) :1
- total time (us) :4385
- total Write : memory 0 MBytes (65536 Bytes) requests 3
[127.0.0.1:52001] rzdbg> inode_trck ?
____[exportd-S1 ]__[ inode_trck ?]____
usage:
expt_thread reset : reset statistics
expt_thread disable : disable export tracking thread
expt_thread enable : enable export tracking thread
expt_thread period [ <period> ] : change thread period(unit is second)
• Inode tracking thread menu:
• Inode tracking thread statistics:
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 68
Slave exportd process Fstats thread
• For each export (eid) there is a global file that contains statistics about the filesystem.• Each time a file is created/removed theses statistics have to be updated. To avoid overloading
the system in terms of disk accesses, RozoFS implements a write-behind mechanism.
• The role of the thread is to update periodically on disk the statistics of the exports handled by a slave exportd process.
• The updated statistics are : the number of files the number of i-nodes
• The corresponding file is located at the root of a export and is named fstat_<slave_id>. where slave_id is the index of the exportd slave process that is responsible of the eid associated with an export.
• The period of the thread can be changed on the fly thanks rozodiag. Any change will take place in memory only and will not survive to a restart of the exportd process. The default period is 5 seconds.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 69
Slave exportd process fstat thread: statistics
[127.0.0.1:52001] rzdbg> fstat_thread ?____[exportd-S1 ]__[ fstat_thread ?]____usage:fstat_thread : display statisticsfstat_thread eid <value> : display eid filesystem statisticsfstat_thread reset : reset statisticsfstat_thread period [ <period> ] : change thread period(unit is second)
[127.0.0.1:52001] rzdbg> fstat_thread ___[exportd-S1 ]__[ fstat_thread ]___period : 5 second(s) - activation counter:12725 - average time (us) :2 - total time (us) :28337statistics : thread_update_count :2
• Fstat thread menu
• fstat thread statistics
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 70
Slave exportd process User/group quota writeback thread
• RozoFS supports natively quota per user and group. By default only accounting is enabled. quota enforcement can be activated thanks rozo_quota commands.
• For performance concerns, a write-behind mechanism associated with a quota cache is implemented. Any modification on quota (user or group) take place in memory (cache).
• A periodic thread takes care of the quota updates and flushes the modified information on disk.
• It can possible to disable the quota write back thread. In that case, the quota updates are synchronous.
• By default the quota write back thread is enabled.• The period of the thread can be changed thanks rozodiag. The default period is 1 second.
• Note: a direct flush can take place when the insertion in the quota cache triggers a LRU.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 71
Slave exportd process User/group quota write back thread: statistics
[127.0.0.1:52001] rzdbg> quota_wb_thread ?____[exportd-S1 ]__[ quota_wb_thread ?]____usage:quota_wbthread reset : reset statisticsquota_wbthread disable : disable writeback dirent cachequota_wbthread enable : enable writeback dirent cachequota_wbthread period [ <period> ] : change thread period(unit is second)
[127.0.0.1:52001] rzdbg> quota_wb_thread____[exportd-S1 ]__[ quota_wb_thread]____period : 1 second(s) statistics : - wr chunk counter :6 - write (hit/miss) :4/2 - activation counter:64217 - average time (us) :36 - total time (us) :2352493total Write : memory 0 MBytes (480 Bytes) requests 6[127.0.0.1:52001] rzdbg> quota_cache____[exportd-S1 ]__[ quota_cache]____lv2 attributes cache : current/max 6/65536hit 6 / miss 4 / lru_del 0entry size 96 - current size 576 - maximum size 6291456
• Quota write-back thread menu
• Quota write-back thread statistics
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 72
RozoFS Metadata disk layout for one exported file system
dentry_slices file inode directory inode ext. Attr. inode sym. link inode
slice dir 1
inode file 1
inode file n
slice dir 256
inode file 1
inode file n
slice dir 1
inode file 1
inode file n
slice dir 256
inode file 1
inode file n
slice dir 1
inode file 1
inode file n
slice dir 256
inode file 1
inode file n
slice dir 1
inode file 1
inode file n
slice dir 256
inode file 1
inode file n
slice dir 1
FID dir
dentry file 1
collision flles
dentry file 1 dentry file 4096
bitmap file
FID dir
dentry file 1
collision flles
dentry file 1 dentry file 4096
bitmap file
slice dir 256
file system metadata root path
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 73
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 74
MDirent file overviewIntroduction
A dirent file is made of two main sections: Management section:
That part contains all the required information to allocate/release chunks in the data section for storing file information
and to perform lookup for searching the unique FID associated with eiher a file or directory
Data section : That section is used to store the information related to the directories, regular
files, hard links handled by RozoFS. The information found in a dirent file are: The external name of the directory/file The unique FID associated with that directory/file at creation time
A dirent file is always relative to a parent directory.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 75
MDirent file overviewfile types
Dirent file types Root dirent file
The root dirent file is the file that is used for starting the lookup of a name within the data section of a dirent file
Collision dirent file A collision dirent file is indirectly accessible across a root dirent file. Theorically a root
dirent file may support up to 2048 collision files
The presence of a collision file associated with a root dirent file is indicated thanks a bitmap handled on the root dirent file.
dirent file naming rules: The name of the dirent file are the following format:
Dirent root filed_<root_idx> where root_idx is the index of the root file (0..4095)
Collision dirent filed_<root_idx>_<coll_idx> where coll_idx is the local index within the associated dirent root
file. Coll_idx is in the range of 0..2047.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 76
MDirent file overview Per parent directory capacity
number of
entries per file
Per Parent directory 384 620
Max number of root dirent file 4096
Max Collision files per root dirent file 64
Max entries per root/collision files 384
Max entries (theorical:2048) (Millions) 102,23 3222,79 5203,47
Max entries (Millions) 100,66 162,52
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 77
MDirent file overview
MDIRENT File Layout (Disk representation for 384 entries case)
Dirent_header
version,dirent_file ref, parent fid,root_idx
coll_entry_bitmap
2048 hash entries
block 0
hash_bitmap
384 hash entries
name_bitmap
4096 chunks of 32 bytes
block 1
hash_table
256 buckets
block 2
hash entries # 0
64 entries of 8 bytes
block 3->8
hash entries # 5
64 entries of 8 bytes
name entries # 0
16 chunks of 64 bytes each
name entries # 215
16 chunks of 64 bytes each
block 9->521
Cadifra Evaluationwww.cadifra.com
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 78
MDirent file overview
DIRENT file layout (memory representation for 384 entries case)dirent_cache_t
dirent_header
free_hash_bitmap_p
free_name_bitmap_p
hash_tbl_p[0]
..........
hash_tbl_p[3]
hash_entry_p[0]
..........
hash_entry_p[5]
name_entry_lvl0_p[0]
..........
name_entry_lvl0_p[7]
dirent_coll_lvl0_p[0]
..........
dirent_coll_lvl0_p[15]
free_hash_bitmap (48 bytes)
384 entries->48 bytes free_name_bitmap
(432 bytes)
384 entries->432 bytes
hash_level_32 (128 bytes)
64 entires->128 bytes
hash_entry (512 bytes)
64 entries->512 bytes
hash_name_lvl0_p (128)
hash_name_lvl1_p[0]
.........
hash_name_lvl1_p[15]
entry_name....
..........
hash_level_32 (128 bytes)
64 entires->128 bytes
hash_entry (512 bytes)
64 entries->512 bytes
coll_dirent_p (512)
dirent_cache_p[0]
..........
dirent_cache_p[63]
entry_name....
..........
hash_name_lvl0_p (128)
hash_name_lvl1_p[0]
.........
hash_name_lvl1_p[15]
entry_name....
..........
entry_name....
..........
coll_dirent_p (512)
dirent_cache_p[0]
..........
dirent_cache_p[2047]
64 pointers per entry
32 blocks of 64 bytes 32 blocks of 64 bytes
32 blocks of 64 bytes 32 blocks of 64 bytes
hash_name_lvl1_p (2048)
hash_name_lvl1_p (2048)
hash_name_lvl1_p (2048)
hash_name_lvl1_p (2048)
Cadifra Evaluationwww.cadifra.com
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 79
MDirent file overviewdisk dump• Searching for the slice directory for which there is some dentries. The search starts at the root of the exported file system:
root@debian:/home/rozofs/off_mgeo/tests/export_1# ls */*/attributes198/a091301b-742a-4c00-1400-000000000018/attributes 20/3706f468-22be-4b00-0000-000000000018/attributes
• The directory that follows the slice directory is the ascii representation of the FID of the user directory. Within a user directory there MUST be a file called attributes and at list one d_<xx> file where <xx> being the result of a hash applied on the name of the object (either a file, symlink or directory):
root@debian:/home/rozofs/off_mgeo/tests/export_1# cd 20/3706f468-22be-4b00-0000-000000000018/root@debian:/home/rozofs/off_mgeo/tests/export_1/20/3706f468-22be-4b00-0000-000000000018# lsattributes d_0 d_1
• The attributes file is a bitmap that contains information relative to the dirent files have been created. It has a fixed size of 512 bytes:
root@debian:/home/rozofs/off_mgeo/tests/export_1/20/3706f468-22be-4b00-0000-000000000018# hexdump attributes
0000000 0003 0000 0000 0000 0000 0000 0000 00000000010 0000 0000 0000 0000 0000 0000 0000 0000*0000200
• The attributes file is a bitmap that contains information relative to the dirent files have been created. Here in the example, it indicates that 2 files have been created: d_0 and d_1
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 80
RozoFS i-node (regular file)
• RozoFS i-node: • 512 bytes structure• Includes regular attributes• RozoFS specific fields:
FID : unique file identifier (RozoFS i-node number) File distribution: cid and list of sids (storage nodes where
to find the projections Parent FID: RozoFS i-node number of the parent directory Dirent_Name: user filename
dirent_name
type (1 bit)
len(15 bits)
hash_suffix(16bits)
coll (1bit)
root_idx(15 bits)
coll_idx (16 bits)
suffix (16 bytes)
name(60 Bytes)
DIRECT
mattr_t attrs
pfid
i_extra_isize
i_file_acl
i_link_name
dirent_name
extended attributes array
INDIRECT
chunk_id (12 bits)
nb_chunks (4 bits)
dirent_file_idx(16 bits)
fid
cid
sids[ROZOFS_SAFE_MAX]
mode
uid,gid
nlink
ctime,atime,mtime
size
children
ext_mattr (512 bytes)
mattr_t
Cadifra Evaluationwww.cadifra.com
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 81
I-node file management
• I-node file An i-node file can contains up to 2044 i-nodes The header of the file contains
A timestamp : creation time of the first i-node An relative i-node index table
trk_10
track_main_[1..4]
exp_trck_file_header_t
creation_time
inode_idx_table[2044]
4096 Bytes
inode_array
inode idx 0
.....
inode idx 2043
exp_trck_header_t
first_idx = 0
last_idx = 10
16 Bytes
trk_0
exp_trck_file_header_t
creation_time
inode_idx_table[2044]
inode_array
inode idx 0
.....
inode idx 2043 Cadifra Evaluationwww.cadifra.com
• Tracking main file: That file keeps track of the first and last
inode file within a slice
• The i-node file management is common to all the i-node types• The difference resides in the payload and size of the i-node only
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 82
I-node access from FID
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 83
I-node allocation
creation_time = T1
inode_idx_table[0] = 0
.....
inode_idx_table[2043] = 2043
exp_trck_file_header_t
inode idx 0
inode_array
Cadifra Evaluationwww.cadifra.com
• RozoFS i-node time and space organization RozoFS i-nodes are always allocated in the increasing number of the i-node file
number (40 bits) within a slice A RozoFS i-node that has been released is never re-allocated A RozoFS i-node file has the timestamp of the first allocated i-node within a i-
node file Thus each RozoFS i-node has an embedded virtual time information The space allocation (i-node and data blocks) needed by the inode file is
achieved thanks the file system (e.g.: ext4) used by RozoFS for storing RozoFS i-node.
• I-node file creation The i-node index table is initialized with the relative index within the header The i-node file header is then saved on disk The i-node content is then appended to the i-node file The i-node file is then saved on disk
• Double allocation prevention In order to avoid reallocating twice same i-node, the header is saved on disk as
all the i-nodes of i-node file were all allocated Upon a restart of the metadata server the index of the first i-node to allocate
within of file is deducted from the real i-node file size.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 84
I-node release
• The i-node release operation consists in clearing the entry index of relative i-node index within its associated i-node file
• The space taken by a released i-node is released by resizing the i-node file.• That takes is achieved thanks a periodic task• When all the indexes within a file header i-node are cleared (-1), the i-node file is deleted.• The main tracking file is updated, if the deleted i-node file matches the index of the first file of the main
tracking file
creation_time = T1
inode_idx_table[0] = 0
inode_idx_table[1] = -1
inode_idx_table[2] = 2
.....
inode_idx_table[2043] = 2043
exp_trck_file_header_t
inode idx 0
inode idx 1
inode idx 2
.....
inode idx 2043
inode_array
creation_time = T1
inode_idx_table[0] = 0
inode_idx_table[1] = -1
inode_idx_table[2] = 1
.....
inode_idx_table[2043] = 2042
exp_trck_file_header_t
inode idx 0
inode idx 2
.....
inode idx 2043
inode_array
Cadifra Evaluationwww.cadifra.com
Before After
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 85
Exportd start-up sequence
Export Master Process
rozolauncher
x8
Exit()
rozolauncher start /var/run/launcher_exportd_slave_<slave_id>.pid exportd
-i <slave_id> -s -c <config_file> -d <rozodiag_port>
exportd -c <config_file> -d <rozodiag_port>
Export Slave Process
Exportd -i <slave_id> -s -c <config_file> -d <rozodiag_port>
• The Master exportd process is started first• It launches 8 Slave exportd processes thanks the rozolauncher• Each rozolauncher starts a exportd process in slave mode and provides
to it the slave_id reference as well as the export configuration file• In case of failure (exit()), the slave exportd process is automatically
restarted by its parent rozolauncher:• Before being restarted, a core dump file might be generated.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 86
Exportd shared memory
root@debian:/home/rozofs/off_mgeo/src/exportd# ipcs -m
------ Segment de mémoire partagée --------clé shmid propriétaire perms octets nattch états 0x00000000 0 root 644 80 2 0x00000000 32769 root 644 16384 2 0x00000000 65538 root 644 280 2 0x00000000 98307 didier 600 33554432 2 dest
0x4558504f 294916 root 666 1216 9 0x524f5a30 163845 root 666 8421376 2 0x524f5a31 196614 root 666 8421376 2 0x524f5a32 229383 root 666 8421376 2 0x524f5a33 262152 root 666 8421376 2 .
• The Master Exportd process supervises the Slave Exportd within a shared memory that it creates when it starts. It permits to report the information related to each exportd slave process.
• Notes: the key of the Master Exportd is a constant « EXPO ». It is not possible to run two exportd Master processes on the same host.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 87
Exportd Slave status
• Information relative to exportd slave status can be obtained from the Master Exportd:[127.0.0.1:52000] rzdbg> exp_slave____[exportd-M]__[ exp_slave]____id | pid | state | uptime | observer port | metadata port | reload count |-----+------------+-----------------+---------------------+----------------+----------------+--------------+ 0 | 11544 | running |1 days, 4:20:53 | 52000 | 53000 | 0 | 1 | 11637 | running |1 days, 4:20:50 | 52001 | 53001 | 0 | 2 | 11639 | running |1 days, 4:20:50 | 52002 | 53002 | 0 | 3 | 11640 | running |1 days, 4:20:50 | 52003 | 53003 | 0 | 4 | 11644 | running |1 days, 4:20:50 | 52004 | 53004 | 0 | 5 | 11646 | running |1 days, 4:20:50 | 52005 | 53005 | 0 | 6 | 11654 | running |1 days, 4:20:50 | 52006 | 53006 | 0 | 7 | 11650 | running |1 days, 4:20:50 | 52007 | 53007 | 0 | 8 | 11653 | running |1 days, 4:20:50 | 52008 | 53008 | 0 |_________________________________________________________
• id: slave identifier. The value 0 is corresponds to the Master Exportd process• pid: the pid of the process• State: current state of the process
Starting: asserted by the Master exportd process when it starts a slave exportd Running: asserted by the Slave process once it is ready to process metadata requests
• Uptime: time since the process is up and running• Observer port: port to be used with rozodiag• Metadata port: listening port for all metadata operation• Reload count: number of time the exportd configuration has been reloaded by the process
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 88
Export reload highlights
• The exportd reload takes place upon a change within the configuration file of the export: adding storage node in cluster, adding a volume, etc…
• The reload is done without stopping the exportd processes (Master and Slave)• This is achieved by raising the SIGUSR1 linux signal• The current operation in progress is suspended until the end of the processing of the
configuration file.• nominal case:
the Master process updates its own configuration and then informs its Slave processes that they must process the new configuration Once a new configuration has been successfully loaded, a new md5 is computed for the
configuration. It is used as a trigger by clients to detect remotely a configuration change and to ask for new configuration.
• Failure case: In case of failure during the parsing of the new configuration, the system reverts to the
current configuration. The Slave processes are not involved in that case.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 89
Exportd reload nominal sequence
Export Master Process
rozolauncher
x8
Exit()
rozolauncher reload /var/run/launcher_exportd_slave_<slave_id>.pid exportd
-i <slave_id> -s -c <config_file> -d <rozodiag_port>
Kill -1 <exportd_master_pid>
Export Slave Process
Kill -1 <exportd_slave_pid>
Validate the new configuration file
Parse and load the new configuration file
Increment the reload count at the end of the processing
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 90
• RozoFS data path Mojette Transform performances Mojette Transform uses cases
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 91
Mojette Transform performancesEncoding/decoding performances with 2 redundancies projections (4+2)
8192 4096 2048 10240
2000
4000
6000
8000
10000
12000
14000
2911
1979
1302
824
933
933
916
92026
04
2604
2474
2474
9897 9897
12371 12371
Encoding performances (4+2) (best case)
sub-title
Cauchy-good 4+2
Reed-sol-van 4+2
Reed-sol-r6 4+2
Mojette 4+2
user data block size (bytes)
thro
ug
hp
ut (
MB
yte
s/s
)
8192 4096 2048 10240
1000
2000
3000
4000
5000
6000
7000
8000
9000
2585
1767
1075
673
908
883
824
727
2414
2367
2061
1767
8247
7069
6185 6185
Encoding performances (4+2) (worst case)
Cauchy-good 4+2
Reed-sol-van 4+2
Reed-sol-r6 4+2
Mojette 4+2
user data block size (bytes)
Th
rou
gh
pu
t (M
Byt
es
/s)
1 20
1000
2000
3000
4000
5000
6000
7000
8000
496 232
5498
677
7331 7452
Decoding performances 4+2
user data block: 4K
Cauchy-good 4+2
Reed-sol-van 4+2
Reed-sol-r6 4+2
Mojette 4+2
number of failures
thro
ug
hp
ut (
MB
yte
s/s
)
1 20
1000
2000
3000
4000
5000
6000
7000
8000
9000
951435
7690
699
8018 8173
Decoding performances 4+2
user data block : 8K
Cauchy-good 4+2 8192
Reed-sol-van 4+2
Mojette 4+2 8192
number of failures
thro
ug
hp
ut (
MB
yte
s/s
)
1. Mojette decoding/encoding is not CPU intensive and fits well on client side2. Mojette decoding time does not depend on number of failures
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 92
Mojette Transform performances
Encoding/decoding performances with 4 redundancies projections (8+4)
8192 4096 2048 10240
1000
2000
3000
4000
5000
6000
482 324 202 95
41234498
4948 4948
Encoding performances 8+4 (best case)
Cauchy-good 8+4
Cauchy-orig 8+4
Reed-sol-van 8+4
Mojette 8+4
user data block (bytes)
thro
ug
hu
t (M
Byt
es
/s)
1 2 3 40
1000
2000
3000
4000
5000
204 74 38 26
4498
568
329
227
22492749 2585
1903
Decoding performances 8+4
user data block: 4K
Cauchy-good 8+4
Reed-sol-van 8+4
Mojettte 8+4
number of failures
Th
rou
gh
pu
t (M
Byt
es
/s)
1 2 3 40
1000
2000
3000
4000
5000
6000
7000
8000
406 142 71 51
7272
648 363 243
31923665 3665
3092
Decoding performances 8+4user data block: 8K
Cauchy-good 8+4 8192
Reed-sol-van 8+4
Mojette 8+4 8192
number of failures
TH
rou
gh
pu
t (M
Byt
es
/s)
8192 4096 2048 10240
500
1000
1500
2000
2500
3000
3500
471319 194 91
3192 3192
2249
1508
Encoding performances 8+4 (worst case)
Cauchy-good 8+4
Cauchy-orig 8+4
Reed-sol-van 8+4
Mojette 8+4
user data block size (bytes)
thro
ug
hp
ut (
MB
yte
s/s
)
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 93
MojetteTransformForward
+Write
Process
RozoFS Layout
Distribution
OSD Node 1
OSD Node 2
OSD Node 3
OSD Node 4
OSD nodes 1,2,3,4)
User p
ayl
oad
RozoFS data-path write serviceFile system block forward transformation (nominal use case)
proj 1.1
proj 2.1
proj 3.1proj1.
2proj2.
2proj
3.2
proj 1.3
proj2.3
proj 3.3proj
1.4
proj 2.4
proj 3.4proj 1.5 proj
2.5
proj 3.5
The set of OSD is provided within the metadata associated with the file User payload is split in User Data Blocks (4K or 8K) Mojette transform is applied on each UDB
Optimaldistributio
n
SpareNode(s)
UDB 1
UDB 2
UDB 3
UDB 4
UDB 5
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 94
RozoFS data-path write servicenominal use case sequence diagram
rozofsmont/storcli application vfs/fuse
pwrite(fd_rozofs,buf,offset,length) pwrite(fd,buf,offset,length)
storage node 1
1 mojette_transform_forward(buf)
write_req(fid,projection1)
write_rsp(ok)
2
3 status
write_req(id,projection3)
write_rsp(ok)
write_req(id,projection2)
write_rsp(ok)
storage node 2
storage node 3
Cadifra Evaluationwww.cadifra.com
Write transactions are performed in parallel Write service ends upon receiving all the responses from OSD nodes
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 95
MojetteTransformForward
+Write
Process
RozoFS Layout
Distribution
OSD Node 1
OSD Node 2
OSD Node 3
OSD Node 4
OSD nodes (1,2,3,4)
User p
ayl
oad
RozoFS data-path write service
failure use case
proj 1.1
proj 2.1
proj 3.1proj1.
2proj2.
2proj
3.2
proj 1.3
proj2.3
proj 3.3proj
1.4
proj 2.4
proj 3.4proj 1.5 proj
2.5
proj 3.5
Spare OSD is used in case of failure of OSD belonging to the optimal distribution
Write operation is successful when n+m projections are successfully written
Optimaldistributio
n
SpareNode(s)
UDB 1
UDB 2
UDB 3
UDB 4
UDB 5
proj 3.1proj 3.2
proj 3.3
proj 3.4proj 3.5
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 96
RozoFS data-path write service
failure sequence diagram
rozofsmont/storcli storage node 3
storage node 4
storage node 1
storage node 2
write_req(fid,projection1)
write_req(id,projection2)
application vfs/fuse
pwrite(fd,buf,offset,length) pwrite(fd_rozofs,buf,offset,length)
mojette_transform_forward(buf) 1
2
write_rsp(ok)
write_rsp(ok)
write_rsp(ok)
status
write_req(id,projection3)
write_req(id,projection3)
3
4
Cadifra Evaluationwww.cadifra.com
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 97
RozoFS data-path read service
Filesystem block Mojette inverse transformation (nominal use case)
optimaldistribution
UDB(4K or 8K)
OSD NODES
projection 2
projection 1 1
2
projection 3Read+
InverseMojette
Transform
3
4
RozoFS Layout
Distribution
OSD nodes (1,2,3,4)
Read
Read
Read process selects n projections among the n+m projections to rebuild a User Data Block
It can be any projection subset (n) in the n+m projection set Read transactions towards the OSD are performed in parallel:
Minimize data transfer delay over the network
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 98
RozoFS data-path read service
sequence diagram (nominal use case)
rozofsmont/storcli storage node 1
storage node 2
read_req(fid,offset,len)
read_req(fid,offset,len)
application vfs/fuse
pread(fd,buf,offset,length) read(fd_rozofs,buf,offset,length)
buf,length
1
2
3
res_rsp(projection1)
read_rsp((projection2)
mojette_transform_inverse(buf,projection1,projection2)
Cadifra Evaluationwww.cadifra.com
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 99
RozoFS data-path read service
failure use case
optimaldistribution
UDB(4K or 8K)
OSD NODES
projection 2
projection 1 1
2
projection 3Read+
InverseMojette
Transform
3
4
RozoFS Layout
Distribution
OSD nodes (1,2,3,4)
Read
Read
Attempt reading on remaining OSD in case of read projection failure:Disk failureNetwork failureOut of date projection
Read
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 100
RozoFS data-path read service
failure sequence diagram
rozofsmont/storcli storage node 1
storage node 2
read_req(fid,offset,len)
read_req(fid,offset,len)
application vfs/fuse
pread(fd,buf,offset,length) read(fd_rozofs,buf,offset,length)
timer_expiration
buf,length
1
2
3
read_rsp(projection3)
4
read_rsp(projection1)
read_req(projection3)
storage node 3
mojette_transform_inverse(buf,projection1,projection3)
Cadifra Evaluationwww.cadifra.com
Fast projection recovery time:Start a guard timer on first projection read replyAt timer expiration read requests are propagated towards remaining OSD
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 101
RozoFS data-path read service
failure sequence diagram: case of a CRC 32 error
rozofsmont/storcli storage node 1
storage node 2
read_req(fid,offset,len)
read_req(fid,offset,len)
application vfs/fuse
pread(fd,buf,offset,length) read(fd_rozofs,buf,offset,length)
1
2
4
read_rsp(projection1)
read_rsp(projection2,crc_error)
read_req(projection3)
read_rsp(projection3)
storage node 3
crc error!!
3
mojette_transform_inverse(buf,projection1,projection3)
buf,length self healing sequence
mojette_transform_forward_one(buf,projection 2)
write_req(projection_2)
write_rsp(OK) block in error is now fixed.
5
6
Cadifra Evaluationwww.cadifra.com
The crc error is detected on the storage node The storage nodes informs that the read failure is due to a CRC error After rebuilding the initial data, the storcli process triggers a transform forward
The transform forward concerns only the faulty projection It might more that one block to regenerate (depends on the number of CRC errors)
Once the projection has been regenerated, it is sent back the associated storage node
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 102
• RozoFS user and group quotas
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 103
RozoFS disk quotaIntroduction
• Quota subsystem allows system administrator to set limits on used space and number of used inodes (inode is a file system structure which is associated which each file or directory) for users and/or groups.
• For both used space and number of used i-nodes there are actually two limits: The first one is called softlimit and the second one hardlimit.
• An user can never exceed a hardlimit for any resource.• When an user exceeds a softlimit (s)he is warned that (s)he uses more space than (s)he should
but space/inode is allocated (of course only if an user also does not exceed the hardlimit).• If an user is exceeding softlimit for specified period of time (this period is called grace time)
(s)he is not allowed to allocate more resources (so (s)he must free some space/i-nodes to get under softlimit).
• Quota limits are set independently for each file system (eid)
•By default RozoFS quotas are always one (accounting), but quota enforcement can be control thanks rozofs_quotaon.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 104
RozoFS disk quotaIntroduction
• There are separate quota files for user and group.• The quota files are specific to an exported filesystem• The structure of the files that contains the quotas are specific to RozoFS:
On can access to the contents of the RozoFS quota files thanks the quota services provided with RozoFS.
• The structure of a quota record within a file is the following:
typedef struct _rozo_mem_dqblk {
int64_t dqb_bhardlimit; /* absolute limit on disk blks alloc */
int64_t dqb_bsoftlimit; /* preferred limit on disk blks */
int64_t dqb_curspace; /* current used space */
int64_t dqb_ihardlimit; /* absolute limit on allocated inodes */
int64_t dqb_isoftlimit; /* preferred inode limit */
int64_t dqb_curinodes; /* current # allocated inodes */
time_t dqb_btime; /* time limit for excessive disk use */
time_t dqb_itime; /* time limit for excessive inode use */
} rozo_mem_dqblk;
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 105
RozoFS quota files
group user
quotainfo_grp
group_btmap
file system metadata root path
group_0
group_n
quotainfo_usr
user_btmap
user_0
user_n
bitmap file
bitmap (8192 bytes)
.......
entry_idx[usr1] =0
entry_idx[2047]
entry_idx[0] = 0x8000
.............
entry_idx[usr1] =0
.......
bitmap file header
bitmap file payload
quota_info(usr1)
quota_info(usr2) Cadifra Evaluationwww.cadifra.com
• The file bitmap permits to cover 64K quota files • Each quota file can contains up to 2048 quota information• The uid/gid is used as a relative index within a quota file to
find out the effective entry in the payload• The structure can cover up to 133 Millions of either users or
groups
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 106
Turning on/off quota on a file system
NAME rozo_quotaon, rozo_quotaoff - turn filesystem quotas on and off
SYNOPSIS /usr/sbin/rozo_quotaon [ -vugfp ] [ -e exportconf-name ] filesystem-id... /usr/sbin/rozo_quotaon [ -avugfp ] [ -e exportconf-name ]
/usr/sbin/rozo_quotaoff [ -vugp ] filesystem-id... /usr/sbin/rozo_quotaoff [ -avugp ]
DESCRIPTION rozo_quotaon rozo_quotaon announces to the system that disk quotas should be enabled on one or more file systems. There are two compo‐ nents to the RozoFS disk quota system: accounting and limit enforcement. RozoFS file systems require that quota account‐ ing be turned on at mount time. It is possible to enable and disable limit enforcement on an RozoFS file system after quota accounting is already turned on. The default is to turn on both accounting and enforcement.
The RozoFS quota implementation does not maintain quota information in user-visible files, but rather stores this information internally.
rozo_quotaoff rozo_quotaoff announces to the system that the specified file systems should have any disk quotas turned off.
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 107
RozoFS Disks quota components
RozoFS disks quota services
Export S 1
Export S 8
conf
rozo_quotaonGet/set quota
Export M
Exportd Host
Metadatadisks
Dentry filesInodes files
rozo_repquota
rozo_setquotaGet/set quota
rozo_warnquota
NameServiceSwitch
sendmail
Direct read quota files
exim4
Getpwuid()Getgrpgid()Getpwuid()
Getgrpgid()
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 108
The Name Service Switch•Name Service Switch configuration file /etc/nsswitch.conf:
• hosts: files dnsOn the right hand side of the colon are the data sources, where NSS will go to retrieve the system database.
• It progresses left to right, checking each source in turn until the data is found.
• On the left hand side of the colon, the groupings of data, the database itself, which we are calling "maps" -- in this example, the passwd database API functions are mapped to the "compat" and "files" data sources.
• When an NSS function is called, the NSS implementation reads its configuration file /etc/nsswitch.conf, which names the library that implements the data retrieval. NSS dynamically loads this library, in this example, libnss_files.so.
• The correct function within this library is then called, for example _nss_files_getpwuid().• libnss_files then opens and parses /etc/passwd, and returns (typically a struct).
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 109
NSS + RFC 2307 LDAP
•Add in a directory service, and you get a situation familiar to many sysadmins. /etc/nsswitch.conf would now also list ldap in addition to files in this example.•If NSS were to load libnss_files.so, and find nothing, it would then load libnss_ldap.so. libnss_ldap.so would make a network connection to the LDAP server, perform a query, and convert the LDAP results into the right return structure.•This means that every query will translate into a TCP connection with handshake overhead, possibly over SSL with its crypto overhead, and then do various ASN.1 and BER en- and decoding within the LDAP protocol itself...
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 110
• Miscellaneous
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 111
RozoFS Core files
• A core file is generated upon a fatal error encountered during the execution of a RozoFS process.
• By default, the system supports upon to 2 core files per process• The core files generated for a RozoFS process are found under /var/run/rozofs_core
directory.• There is one directory per RozoFS process:
Exportd : Master metadata server Export_slave: Slave metadata server Geomgr: Geo-replication process Rozofsmount: RozoFS client process Storaged: Master storaged process Storio : Slave storaged process Storcli: RozoFS erasure coding read/write client process
© This document is proprietary and confidential. No part of this document may be disclosed in any manner to a third party without the prior written consent of FIZIANS SAS. 112
List of the reserved ports for RozoFS usage
[127.0.0.1:50004] rzdbg> reserved_ports____[storcli 1 of rozofsmount 0]__[ reserved_ports]___________._____._______.___________________________.____________________________________ Value | Nb | Const | /etc/services | Role_______|_____|_______|___________________________|____________________________________ 52000 | 9 | 52000 | rozofs_export_diag | Export master and slave diagnostic 53000 | 9 | 53000 | rozofs_export_eproto | Export master and slave eproto 53010 | 1 | 53010 | rozofs_export_geo_replica | Export master geo-replication 50003 | 24 | 50003 | rozofs_mount_diag | rozofsmount & storcli diagnostic 50027 | 256 | 50027 | rozofs_storaged_diag | Storaged & storio diagnostic 51000 | 1 | 51000 | rozofs_storaged_mproto | Storaged mproto 54000 | 91 | 54000 | rozofs_geomgr_diag | Geo-replication manager, clients & storcli diagnostic_______|_____|_______|___________________________|____________________________________
echo net.ipv4.ip_local_reserved_ports="52000-52008,53000-53008,53010-53010,50003-50026,50027-50282,51000-51000,54000-54090" >> /etc/sysctl.conf
echo "52000-52008,53000-53008,53010-53010,50003-50026,50027-50282,51000-51000,54000-54090" > /proc/sys/net/ipv4/ip_local_reserved_ports
grep ip_local_reserved_ports /etc/sysctl.conf
cat /proc/sys/net/ipv4/ip_local_reserved_ports