Kunpeng BoostKit for SDS Tuning Guide Issue 10 Date 2021-09-13 HUAWEI TECHNOLOGIES CO., LTD.
Kunpeng BoostKit for SDS
Tuning Guide
Issue 10
Date 2021-09-13
HUAWEI TECHNOLOGIES CO., LTD.
Copyright © Huawei Technologies Co., Ltd. 2021. All rights reserved.
No part of this document may be reproduced or transmitted in any form or by any means without priorwritten consent of Huawei Technologies Co., Ltd. Trademarks and Permissions
and other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.All other trademarks and trade names mentioned in this document are the property of their respectiveholders. NoticeThe purchased products, services and features are stipulated by the contract made between Huawei andthe customer. All or part of the products, services and features described in this document may not bewithin the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements,information, and recommendations in this document are provided "AS IS" without warranties, guaranteesor representations of any kind, either express or implied.
The information in this document is subject to change without notice. Every effort has been made in thepreparation of this document to ensure accuracy of the contents, but all statements, information, andrecommendations in this document do not constitute a warranty of any kind, express or implied.
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. i
Contents
1 Using the Kunpeng Hyper Tuner for Tuning..................................................................... 1
2 Ceph Block Storage Tuning Guide....................................................................................... 22.1 Introduction............................................................................................................................................................................... 22.1.1 Components........................................................................................................................................................................... 22.1.2 Environment.......................................................................................................................................................................... 52.1.3 Tuning Guidelines and Process Flow............................................................................................................................. 72.2 General-Purpose Storage................................................................................................................................................... 102.2.1 Hardware Tuning............................................................................................................................................................... 102.2.2 System Tuning.................................................................................................................................................................... 102.2.3 Ceph Tuning........................................................................................................................................................................ 172.2.4 KAE zlib Compression Tuning........................................................................................................................................ 252.3 High-Performance Storage................................................................................................................................................ 282.3.1 Hardware Tuning............................................................................................................................................................... 282.3.2 System Tuning.................................................................................................................................................................... 282.3.3 Ceph Tuning........................................................................................................................................................................ 33
3 Ceph Object Storage Tuning Guide...................................................................................413.1 Introduction............................................................................................................................................................................ 413.1.1 Overview...............................................................................................................................................................................413.1.2 Environment........................................................................................................................................................................ 433.1.3 Tuning Guidelines and Process Flow...........................................................................................................................463.2 Cold Storage........................................................................................................................................................................... 483.2.1 Hardware Tuning............................................................................................................................................................... 483.2.2 System Tuning.................................................................................................................................................................... 493.2.3 Ceph Tuning........................................................................................................................................................................ 553.3 General-Purpose Storage................................................................................................................................................... 633.3.1 Hardware Tuning............................................................................................................................................................... 633.3.2 System Tuning.................................................................................................................................................................... 633.3.3 Ceph Tuning........................................................................................................................................................................ 703.3.4 KAE zlib Compression Tuning........................................................................................................................................ 783.4 High-Performance Storage................................................................................................................................................ 823.4.1 Hardware Tuning............................................................................................................................................................... 823.4.2 Ceph Tuning........................................................................................................................................................................ 82
Kunpeng BoostKit for SDSTuning Guide Contents
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. ii
3.4.3 KAE MD5 Digest Algorithm Tuning.............................................................................................................................85
4 Ceph File Storage Tuning Guide........................................................................................ 884.1 Introduction............................................................................................................................................................................ 884.1.1 Components........................................................................................................................................................................ 884.1.2 Environment........................................................................................................................................................................ 914.1.3 Tuning Guidelines and Process Flow...........................................................................................................................934.2 General-Purpose Storage................................................................................................................................................... 954.2.1 Hardware Tuning............................................................................................................................................................... 964.2.2 System Tuning.................................................................................................................................................................... 964.2.3 Ceph Tuning...................................................................................................................................................................... 1034.2.4 KAE zlib Compression Tuning..................................................................................................................................... 111
A Change History....................................................................................................................115
Kunpeng BoostKit for SDSTuning Guide Contents
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. iii
1 Using the Kunpeng Hyper Tuner forTuning
To tune the performance of components in the Kunpeng BoostKit for SDS, you canuse the Kunpeng Hyper Tuner. When creating an analysis project, selectDistributed Storage. For details, see Kunpeng Hyper Tuner.
Kunpeng BoostKit for SDSTuning Guide 1 Using the Kunpeng Hyper Tuner for Tuning
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 1
2 Ceph Block Storage Tuning Guide
2.1 Introduction
2.2 General-Purpose Storage
2.3 High-Performance Storage
2.1 Introduction
2.1.1 Components
CephCeph is a distributed, scalable, reliable, and high-performance storage systemplatform that supports storage interfaces including block devices, file systems, andobject gateways. The optimization methods described in this document includehardware optimization and software configuration optimization. Software codeoptimization is not involved. By adjusting the system and Ceph configurationparameters, Ceph can fully utilize the hardware performance of the system. CephPlacement Group (PG) distribution optimization and object storage daemon(OSD) core binding aim to balance drive loads and prevent any OSD frombecoming a bottleneck. In addition, in general-purpose storage scenarios, usingNVMe SSDs as Bcache can also improve performance. Figure 2-1 shows the Cepharchitecture.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 2
Figure 2-1 Ceph architecture
Table 2-1 describes the Ceph modules and components.
Table 2-1 Module functions
Module Function
RADOS Reliable Autonomic Distributed Object Store (RADOS) is theheart of a Ceph storage cluster. Everything in Ceph is stored byRADOS in the form of objects irrespective of their data types. TheRADOS layer ensures data consistency and reliability throughdata replication, fault detection and recovery, and data recoveryacross cluster nodes.
OSD Object storage daemons (OSDs) store the actual user data. EveryOSD is usually bound to one physical drive. The OSDs handle theread/write requests from clients.
MON The monitor (MON) is the most important component in a Cephcluster. It manages the Ceph cluster and maintains the status ofthe entire cluster. The MON ensures that related components ofa cluster can be synchronized at the same time. It functions asthe leader of the cluster and is responsible for collecting,updating, and publishing cluster information. To avoid singlepoints of failure (SPOFs), multiple MONs are deployed in a Cephenvironment, and they must handle the collaboration betweenthem.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 3
Module Function
MGR The manager (MGR) is a monitoring system that providescollection, storage, analysis (including alarming), andvisualization functions. It makes certain cluster parametersavailable for external systems.
Librados Librados is a method that simplifies access to RADOS. Currently,it supports programming languages PHP, Ruby, Java, Python, C,and C++. It provides RADOS, a local interface of the Ceph storagecluster, and is the base component of other services such as theRADOS block device (RBD) and RADOS gateway (RGW). Inaddition, it provides the Portable Operating System Interface(POSIX) for the Ceph file system (CephFS). The Librados API canbe used to directly access RADOS, enabling developers to createtheir own interfaces for accessing the Ceph cluster storage.
RBD The RADOS block device (RBD) is the Ceph block device thatprovides block storage for external systems. It can be mapped,formatted, and mounted like a drive to a server.
RGW The RADOS gateway (RGW) is a Ceph object gateway thatprovides RESTful APIs compatible with S3 and Swift. The RGWalso supports multi-tenant and OpenStack Identity service(Keystone).
MDS The Ceph Metadata Server (MDS) tracks the file hierarchy andstores metadata used only for CephFS. The RBD and RGW do notrequire metadata. The MDS does not directly provide dataservices for clients.
CephFS The CephFS provides a POSlX-compatible distributed file systemof any size. It depends on the Ceph MDS to track the filehierarchy, namely the metadata.
Vdbench
Vdbench is a command line utility designed to help engineers and customersgenerate drive I/O loads for verifying storage performance and data integrity. Youcan also specify Vdbench execution parameters by entering text files.
Vdbench has many parameters. Table 2-2 lists some important commonparameters.
Table 2-2 Common parameters
Parameter
Description
-f Specifies a script file for the pressure test.
-o Specifies the path for exporting a report. The default value is thecurrent path.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 4
Parameter
Description
lun Specifies the LUN device or file to be tested.
size Specifies the size of the LUN device or file to be tested.
rdpct Specifies the read percentage. The value 100 indicates full read, andthe value 0 indicates full write.
seekpct Specifies the percentage of random data. The value 100 indicates allrandom data, and the value 0 indicates sequential data.
elapsed Specifies the duration of the current test.
2.1.2 Environment
Physical NetworkingThe physical environment of the Ceph block devices contains two network layersand three nodes. In the physical environment, the MON, MGR, MDS, and OSDnodes are deployed together. At the network layer, the public network is separatedfrom the cluster network. The two networks use 25GE optical ports forcommunication.
Figure 2-2 shows the physical network.
Figure 2-2 Physical networking
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 5
Hardware Configuration
Table 2-3 shows the Ceph hardware configuration.
Table 2-3 Hardware configuration
Server TaiShan 200 server (model 2280)
Processor Kunpeng 920 5230 processor
Core 2 x 32-core
CPU frequency 2600 MHz
Memory capacity 12 x 16 GB
Memory frequency 2666 MHz (8 Micron 2R memory modules)
NIC IN200 NIC (4 x 25GE)
Drive System drives: RAID 1 (2 x 960 GB SATA SSDs)Data drives of general-purpose storage: JBOD enabledin RAID mode (12 x 4 TB SATA HDDs)
NVMe SSD Acceleration drive of general-purpose storage: 1 x 3.2TB ES3600P V5 NVMe SSDData drives of high-performance storage: 12 x 3.2 TBES3600P V5 NVMe SSDs
RAID controller card Avago SAS 3508
Software Versions
Table 2-4 lists the required software versions.
Table 2-4 Software versions
Software Version
OS CentOS Linux release 7.6.1810
openEuler 20.03 LTS SP1
Ceph 14.2.x Nautilus
ceph-deploy 2.0.1
Vdbench 5.04.06
Node Information
Table 2-5 describes the IP network segment planning of the hosts.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 6
Table 2-5 Node information
Host Type HostName
Public NetworkSegment
Cluster NetworkSegment
OSD/MON node Node 1 192.168.3.0/24 192.168.4.0/24
OSD/MGR node Node 2 192.168.3.0/24 192.168.4.0/24
OSD/MDS node Node 3 192.168.3.0/24 192.168.4.0/24
Component DeploymentTable 2-6 describes the deployment of service components in the Ceph blockdevice cluster.
Table 2-6 Component deployment
Physical MachineName
OSD MON MGR
Node 1 12 OSDs 1 MON 1 MGR
Node 2 12 OSDs 1 MON 1 MGR
Node 3 12 OSDs 1 MON 1 MGR
Cluster CheckRun the ceph health command to check the cluster health status. If HEALTH_OKis displayed, the cluster is running properly.
2.1.3 Tuning Guidelines and Process FlowThe block storage tuning varies with the hardware configuration.
● General-purpose storageHDDs are used as data drives, and solid state disks (SSDs) are configured asDB/WAL partitions and metadata storage pools.
● High-performance storageAll data drives are SSDs.
Perform the tuning based on your hardware configuration.
Tuning GuidelinesPerformance optimization must comply with the following principles:
● When analyzing the performance, analyze the system resource bottlenecksfrom multiple aspects. For example, insufficient memory capacity may causethe CPU to be occupied by memory scheduling tasks and the CPU usage toreach 100%.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 7
● Adjust only one performance parameter at a time.● The analysis tool may consume system resources and aggravate certain
system resource bottlenecks. Therefore, the impact on applications must beavoided or minimized.
Tuning Process FlowThe tuning analysis flow is as follows:
1. In many cases, pressure test traffic is not completely sent to the backend(server). For example, a protection policy may be triggered on network accesslayer services such as Server Load Balancing (SLB), Web Application Firewall(WAF), High Defense IP, and even Content Delivery Network (CDN) /siteacceleration in a cloud-based architecture. This occurs because thespecifications, such as bandwidth, maximum number of connections, andnumber of new connections, are limited, or the pressure test shows thefeatures of Challenge Collapsar (CC) and Distributed Denial of Service (DDoS)attacks. As a result, the pressure test results do not meet expectations.
2. Check whether the key indicators meet the requirements. If not, locate thefault. The fault may be caused by the servers (in most cases) or the clients (ina few cases).
3. If the problem is caused by the servers, focus on the hardware indicators suchas the CPU, memory, drive I/O, and network I/O. Locate the fault and performfurther analysis on the abnormal hardware indicator.
4. If all hardware indicators are normal, check the middleware indicators such asthe thread pool, connection pool, and GC indicators. Perform further analysisbased on the abnormal middleware indicator.
5. If all middleware indicators are normal, check the database indicators such asthe slow query SQL indicators, hit ratio, locks, and parameter settings.
6. If the preceding indicators are normal, the algorithm, buffer, cache,synchronization, or asynchronization of the applications may be faulty.Perform further analysis.
Table 2-7 lists the possible bottlenecks.
Table 2-7 Possible bottlenecks
Bottleneck
Description
Hardware/Specifications
Problems of the CPU, memory, and drive I/O. The problems areclassified into server hardware bottlenecks and network bottlenecks(Network bottlenecks can be ignored in a LAN).
Middleware
Problems of software such as application servers and web servers, anddatabase systems. For example, a bottleneck may be caused ifparameters of the Java Database Connectivity (JDBC) connection poolconfigured on the WebLogic platform are set improperly.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 8
Bottleneck
Description
Applications
Problems related to applications developed by developers. Forexample, when the system receives a large number of user requests,the following problems may cause low system performance, includingslow SQL statements and improper Java Virtual Machine (JVM)parameters, container settings, database design, program architectureplanning, and program design (insufficient threads for serial processingand request processing, no buffer, no cache, and mismatch betweenproducers and consumers).
OS Problems related to the OS such as Windows, UNIX, or Linux. Forexample, if the physical memory capacity is insufficient and the virtualmemory capacity is improper during a performance test, the virtualmemory swap efficiency may be greatly reduced. As a result, theresponse time is increased. This bottleneck is caused by the OS.
Networkdevices
Problems related to devices such as the firewalls, dynamic loadbalancers, and switches. Currently, more network access products areused in the cloud service architecture, including but not limited to theSLB, WAF, High Defense IP, CDN, and site acceleration. For example, ifa dynamic load distribution mechanism is set on the dynamic loadbalancer, the dynamic load balancer automatically sends subsequenttransaction requests to low-load servers when the hardware resourceusage of a server reaches the limit. If the dynamic load balancer doesnot function as expected in the test, the problem is a networkbottleneck.
General tuning procedure:
Figure 2-3 shows the general tuning procedure.
Figure 2-3 General tuning procedure
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 9
2.2 General-Purpose Storage
2.2.1 Hardware Tuning
NVMe SSD Tuning● Purpose
Reduce cross-chip data overheads.● Procedure
Install the NVMe SSDs and NIC into the same riser card.
DIMM Installation Mode Tuning● Purpose
Populate one DIMM Per Channel (1DPC) to maximize the memoryperformance. That is, populate the DIMM 0 slot of each channel.
● ProcedurePreferentially populate DIMM 0 slots (DIMM 000, 010, 020, 030, 040, 050,100, 110, 120, 130, 140, and 150). Of the three digits in the DIMM slotnumber, the first digit indicates the CPU, the second digit indicates the DIMMchannel, and the third digit indicates the DIMM. Populate the DIMM slotswhose third digit is 0 in ascending order.
2.2.2 System Tuning
Optimizing the OS Configuration● Purpose
Adjust the system configuration to maximize the hardware performance.● Procedure
Table 2-8 lists the optimization items.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 10
Table 2-8 OS configuration parameters
Parameter Description Suggestion ConfigurationMethod
vm.swappiness
The swappartition is thevirtual memoryof the system.Do not use theswap partitionbecause it willdeterioratesystemperformance.
Default value: 60Symptom: Theperformancedeterioratessignificantly whenthe swappartition is used.Suggestion:Disable the swappartition and setthis parameter to0.
Run the followingcommand:sudo sysctl vm.swappiness=0
MTU Maximum size ofa data packetthat can passthrough a NIC.After the value isincreased, thenumber ofnetwork packetscan be reducedand theefficiency can beimproved.
Default value:1500 bytesSymptom: Runthe ip addrcommand to viewthe value.Suggestion: Setthe maximumsize of a datapacket that canpass through aNIC to 9000bytes.
1. Run thefollowingcommand:vi /etc/sysconfig/network-scripts/ifcfg-$(Interface)AddMTU="9000".NOTE
${Interface}indicates thenetwork portname.
2. After theconfiguration iscomplete, restartthe networkservice.service network restart
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 11
Parameter Description Suggestion ConfigurationMethod
pid_max The defaultvalue ofpid_max is32768, which issufficient innormal cases.However, whenheavy workloadsare beingprocessed, thisvalue isinsufficient andmay causememoryallocationfailure.
Default value:32768Symptom: Runthe cat /proc/sys/kernel/pid_maxcommand to viewthe value.Suggestion: Setthe maximumnumber ofthreads that canbe generated inthe system to4194303.
Run the followingcommand:echo 4194303 > /proc/sys/kernel/pid_max
file_max Maximumnumber of filesthat can beopened by allprocesses in thesystem. Inaddition, someprograms cancall the setrlimitinterface to setthe limit on eachprocess. If thesystem generatesa large numberof errorsindicating thatfile handles areused up, increasethe value of thisparameter.
Default value:13291808Symptom: Runthe cat /proc/sys/fs/file-max command toview the value.Suggestion: Setthe maximumnumber of filesthat can beopened by allprocesses in thesystem to thevalue displayedafter the cat /proc/meminfo |grep MemTotal |awk '{print $2}'command is run.
Run the followingcommand:echo ${file-max} > /proc/sys/fs/file-max
NOTE${file-max} is thevalue displayed afterthe cat /proc/meminfo | grepMemTotal | awk'{print $2}' is run.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 12
Parameter Description Suggestion ConfigurationMethod
read_ahead Linux readaheadmeans that theLinux kernelprefetches acertain area ofthe specified fileand loads it intothe page cache.As a result, whenthe area isaccessedsubsequently,block caused bypage fault willnot occur.Reading datafrom memory ismuch faster thanreading datafrom drives.Therefore, thereadaheadfeature caneffectivelyreduce thenumber of driveseeks and theI/O waiting timeof theapplications. It isone of theimportantmethods foroptimizing thedrive read I/Operformance.
Default value:128 KBSymptom:Readahead caneffectively reducethe number ofdrive seeks andthe I/O waitingtime of theapplications. Runthe /sbin/blockdev --getra /dev/sdb toview the value.Suggestion:Change the valueto 8192 KB.Improve the driveread efficiency bypre-reading andrecording thedata to randomaccess memory(RAM).
Run the followingcommand:/sbin/blockdev --setra /dev/sdb
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 13
Parameter Description Suggestion ConfigurationMethod
I/O_Scheduler
The Linux I/Oscheduler is acomponent ofthe Linux kernel.You can adjustthe scheduler tooptimize systemperformance.
Default value:CFQSymptom: TheLinux I/Oscheduler needsto be configuredbased ondifferent storagedevices for theoptimal systemperformance.Suggestion: Setthe I/Oscheduling policyto deadline forHDDs and noopfor SSDs.
Run the followingcommand:echo deadline > /sys/block/sdb/queue/scheduler
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
nr_requests If the Linuxsystem receives alarge number ofread requests,the defaultnumber ofrequest queuesmay beinsufficient. Todeal with thisproblem, you candynamicallyadjust thedefault numberof requestqueues inthe /sys/block/hda/queue/nr_requests file.
Default value:128Symptom:Increase the drivethroughput byadjusting thenr_requestsparameter.Suggestion: Setthe number ofdrive requestqueues to 512.
Run the followingcommand:echo 512 > /sys/block/sdb/queue/nr_requests
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
Optimizing the Network Performance● Purpose
This test uses the 25GE Ethernet adapter (Hi1822) with four ports, SFP+. It isused as an example to describe how to optimize the NIC parameters for theoptimal performance.
● Procedure
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 14
The optimization methods include adjusting NIC parameters and interrupt-core binding (binding interrupts to the physical CPU of the NIC). Table 2-9describes the optimization items.
Table 2-9 NIC parameters
Parameter Description Suggestion
irqbalance System interruptbalancing service, whichautomatically allocatesNIC software interrupts toidle CPUs.
Default value: activeSymptom: When thisfunction is enabled, thesystem automaticallyallocates NIC softwareinterrupts to idle CPUs.Suggestion:● To disable irqbalance,
set this parameter toinactive.systemctl stop irqbalance
● Keep the functiondisabled after theserver is restarted.systemctl disable irqbalance
rx_buff Aggregation of largenetwork packets requiresmultiple discontinuousmemory pages and causeslow memory usage. Youcan increase the value ofthis parameter to improvethe memory usage.
Default value: 2Symptom: When the valueis set to 2 by default,interrupts consume alarge number of CPUresources.Suggestion: Load therx_buff parameter and setthe value to 8 to reducediscontinuous memoryand improve memoryusage and performance.For details, see descriptionfollowing the table.
ring_buffer You can increase thethroughput by adjustingthe NIC buffer size.
Default value: 1024Symptom: Run theethtool -g NIC namecommand to view thevalue.Suggestion: Change thering_buffer queue size to4096. For details, seedescription following thetable.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 15
Parameter Description Suggestion
lro lro indicates large receiveoffload. After this functionis enabled, multiple smallpackets are aggregatedinto one large packet forbetter efficiency.
Default value: offSymptom: After thisfunction is enabled, themaximum throughputincreases significantly.Suggestion: Enable thelarge-receive-offloadfunction to help networksimprove the efficiency ofsending and receivingpackets. For details, seedescription following thetable.
hinicadm_lro_-ihinic0_-t_<NUM>
Received aggregatedpackets are sent after thetime specified by NUM (inmicroseconds). You canset the value to 256microseconds for betterefficiency.
Default value: 16microsecondsSymptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 256microseconds.
hinicadm_lro_-i_hinic0_-n_<NUM>
Received aggregatedpackets are sent after thenumber of aggregatedpackets reaches the valuespecified by <NUM>. Youcan set the value to 32 forbetter efficiency.
Default value: 4Symptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 32.
– Adjusting rx_buff
i. Go to the /etc/modprobe.d directory.cd /etc/modprobe.d
ii. Create the hinic.conf file.vi /etc/modprobe.d/hinic.conf
Add the following information to the file:options hinic rx_buff=8
iii. Reload the driver.rmmod hinicmodprobe hinic
iv. Check whether the value of rx_buff is changed to 8.cat /sys/bus/pci/drivers/hinic/module/parameters/rx_buff
– Adjusting ring_buffer
i. Change the buffer size from the default value 1024 to 4096.ethtool -G <NIC name> rx 4096 tx 4096
ii. Check the current buffer size.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 16
ethtool -g <NIC name>
– Enabling LRO
i. Enable the LRO function for a NIC.ethtool -K <NIC name> lro on
ii. Check whether the function is enabled.ethtool -k <NIC name> | grep large-receive-offload
NO TE
In addition to optimizing the preceding parameters, you need to bind the NIC softwareinterrupts to the cores.
1. Disable the irqbalance service.
2. Query the NUMA node to which the NIC belongs:cat /sys/class/net/<Network port name>/device/numa_node
3. Query the CPU cores that correspond to the NUMA node.lscpu
4. Query the interrupt ID corresponding to the NIC.cat /proc/interrupts | grep <Network port name> | awk -F ':' '{print $1}'
5. Bind the software interrupt to the core corresponding to the NUMA node.echo <core number> > /proc/irq/ <Interrupt ID> /smp_affinity_list
2.2.3 Ceph Tuning
Modifying Ceph Configuration● Purpose
Adjust the Ceph configuration to maximize system resource usage.● Procedure
You can edit the /etc/ceph/ceph.conf file to modify all Ceph configurationparameters. For example, you can osd_pool_default_size = 4 to the /etc/ceph/ceph.conf file to change the default number of copies to 4 and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.The preceding operations take effect only on the current Ceph node. You needto modify the ceph.conf file on all Ceph nodes and restart the Ceph daemonprocess for the modification to take effect on the entire Ceph cluster. Table2-10 describes the Ceph optimization items.
Table 2-10 Ceph parameter configuration
Parameter Description Suggestion
[global]
cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.
Recommended value:192.168.4.0/24You can set this parameteras required as long as it isdifferent from the publicnetwork segment.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 17
Parameter Description Suggestion
public_network Recommended value:192.168.3.0/24You can set this parameteras required as long as it isdifferent from the clusternetwork segment.
osd_pool_default_size
Number of copies Recommended value: 3
osd_memory_target
Size of memory thateach OSD process isallowed to obtain
Recommended value:4294967296
For details about how to optimize other parameters, see Table 2-11.
Table 2-11 Other parameter configuration
Parameter Description Suggestion
[global]
osd_pool_default_min_size
Minimum number of I/Ocopies that the PG canreceive. If a PG is in thedegraded state, its I/Ocapability is notaffected.
Default value: 0Recommended value: 1
cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.
This parameter has nodefault value.Recommended value:192.168.4.0/24
osd_memory_target
Size of memory thateach OSD process isallowed to obtain
Default value: 4294967296Recommended value:4294967296
[mon]
mon_clock_drift_allowed
Clock drift betweenMONs
Default value: 0.05Recommended value: 1
mon_osd_min_down_reporters
Minimum down OSDquantity that triggers areport to the MONs
Default value: 2Recommended value: 13
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 18
Parameter Description Suggestion
mon_osd_down_out_interval
Number of seconds thatCeph waits before anOSD is marked as downor out
Default value: 600Recommended value: 600
[OSD]
osd_journal_size OSD journal size Default value: 5120Recommended value:20000
osd_max_write_size
Maximum size (in MB)of data that can bewritten by an OSD at atime
Default value: 90Recommended value: 512
osd_client_message_size_cap
Maximum size (in bytes)of data that can bestored in the memory bythe clients
Default value: 100Recommended value:2147483648
osd_deep_scrub_stride
Number of bytes thatcan be read during deepscrubbing
Default value: 524288Recommended value:131072
osd_map_cache_size
Size of the cache (inMB) that stores the OSDmap
Default value: 50Recommended value: 1024
osd_recovery_op_priority
Restoration priority. Thevalue ranges from 1 to63. A larger valueindicates higher resourceusage.
Default value: 3Recommended value: 2
osd_recovery_max_active
Number of activerestoration requests inthe same period
Default value: 3Recommended value: 10
osd_max_backfills Maximum number ofbackfills allowed by anOSD
Default value: 1Recommended value: 4
osd_min_pg_log_entries
Minimum number ofreserved PG logs
Default value: 3000Recommended value:30000
osd_max_pg_log_entries
Maximum number ofreserved PG logs
Default value: 3000Recommended value:100000
osd_mon_heartbeat_interval
Interval (in seconds) foran OSD to ping a MON
Default value: 30Recommended value: 40
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 19
Parameter Description Suggestion
ms_dispatch_throttle_bytes
Maximum number ofmessages to bedispatched
Default value: 104857600Recommended value:1048576000
objecter_inflight_ops
Allowed maximumnumber of unsent I/Orequests. This parameteris used for client trafficcontrol. If the number ofunsent I/O requestsexceeds the threshold,the application I/O isblocked. The value 0indicates that thenumber of unsent I/Orequests is not limited.
Default value: 1024Recommended value:819200
osd_op_log_threshold
Number of operationlogs to be displayed at atime
Default value: 5Recommended value: 50
osd_crush_chooseleaf_type
Bucket type when theCRUSH rule useschooseleaf
Default value: 1Recommended value: 0
journal_max_write_bytes
Maximum number ofjournal bytes that canbe written at a time
Default value: 10485760Recommended value:1073714824
journal_max_write_entries
Maximum number ofjournal records that canbe written at a time
Default value: 100Recommended value:10000
[Client]
rbd_cache RBD cache Default value: TrueRecommended value: True
rbd_cache_size RBD cache size (inbytes)
Default value: 33554432Recommended value:335544320
rbd_cache_max_dirty
Maximum number ofdirty bytes allowedwhen the cache is set tothe writeback mode. Ifthe value is 0, the cacheis set to thewritethrough mode.
Default value: 25165824Recommended value:134217728
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 20
Parameter Description Suggestion
rbd_cache_max_dirty_age
Duration (in seconds)for which the dirty datais stored in the cachebefore being flushed tothe drives
Default value: 1Recommended value: 30
rbd_cache_writethrough_until_flush
This parameter is usedfor compatibility withthe virtio driver earlierthan linux-2.6.32. Itprevents the situationthat data is written backwhen no flush request issent. After thisparameter is set, librbdprocesses I/Os inwritethrough mode. Themode is switched towriteback only after thefirst flush request isreceived.
Default value: TrueRecommended value: False
rbd_cache_max_dirty_object
Maximum number ofobjects. The defaultvalue is 0, whichindicates that thenumber is calculatedbased on the RBD cachesize. By default, librbdlogically splits the driveimage in a unit of 4 MB.Each chunk object isabstracted as an object.librbd manages thecache object. You canincrease the value of thisparameter improve theperformance.
Default value: 0Recommended value: 2
rbd_cache_target_dirty
Dirty data size thattriggers writeback. Thevalue cannot exceed thevalue ofrbd_cache_max_dirty.
Default value: 16777216Recommended value:235544320
Optimizing the PG Distribution● Purpose
Adjust the number of PGs on each OSD to balance the load on each OSD.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 21
● ProcedureBy default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}pgp_num {pgp-num} command to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand is used to enable or disable the Ceph balancer function.Table 2-12 describes the PG distribution parameters.
Table 2-12 PG distribution parameters
Parameter Description Suggestion
pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.
Default value: 8Symptom: A warning isdisplayed if the number ofPGs is insufficient.Suggestion: Calculate thevalue based on theformula.
pgp_num Set the number of PGPsto be the same as thatof PGs.
Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as the number ofPGs.Suggestion: Calculate thevalue based on theformula.
ceph_balancer_mode
Enable the balancerplug-in and set theplug-in mode toupmap.
Default value: noneSymptom: If the number ofPGs is unbalanced, someOSDs may be overloadedand become bottlenecks.Recommended value:upmap
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 22
NO TE
● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.
● Run the eph balancer mode upmap and ceph balancer on commands toautomatically balance and optimize Ceph PGs. Ceph adjusts the distribution of afew PGs every 60 seconds. Run the ceph balancer eval or ceph pg dumpcommand to view the PG distribution. If the PG distribution does not change, thedistribution is optimal.
● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs corresponding to each OSD, thedistribution of the primary PGs also needs to be optimized. That is, the primaryPGs need to be distributed to each OSD as evenly as possible.
Binding OSDs to CPU Cores● Purpose
Bind each OSD process to a fixed CPU core.● Procedure
Add osd_numa_node = <NUM> to the /etc/ceph/ceph.conf file.Table 2-13 describes the optimization items.
Table 2-13 OSD core binding parameters
Parameter Description Suggestion
[osd.n]
osd_numa_node Bind the osd.n daemonprocess to a specified idleNUMA node, which is anode other than thenodes that process theNIC software interrupt.
This parameter has nodefault value.Symptom: If the CPU ofeach OSD process is thesame as that of the NICinterrupt, some CPUs maybe overloaded.Suggestion: To balance theCPU load pressure, avoidrunning each OSD processand NIC interrupt process(or other processes withhigh CPU usage) on thesame NUMA node.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 23
NO TE
● The Ceph OSD daemon process and NIC software interrupt process must run ondifferent NUMA nodes. Otherwise, CPU bottlenecks may occur when the networkload is heavy. By default, Ceph evenly allocates OSD processes to all CPU cores.You can add the osd_numa_node parameter to the ceph.conf file to avoid runningeach OSD process and NIC interrupt process (or other processes with high CPUusage) on the same NUMA node.
● Optimizing the Network Performance describes how to bind NIC softwareinterrupts to the CPU core of the NUMA node to which the NIC belongs. When thenetwork load is heavy, the usage of the CPU core bound to the software interruptsis high. Therefore, you are advised to set osd_numa_node to a NUMA nodedifferent from that of the NIC. For example, run the cat /sys/class/net/PortName/device/numa_node command to query the NUMA node of the NIC. If theNIC belongs to NUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 toprevent the OSD and NIC software interrupt from using the same CPU core.
Optimizing Compression Algorithm Configuration Parameters● Purpose
Adjust the compression algorithm configuration parameters to optimize theperformance of the compression algorithm.
● ProcedureThe default value of bluestore_min_alloc_size_hdd for Ceph is 32 KB. Thevalue of this parameter affects the size of the final data obtained after thecompression algorithm is run. Set this parameter to a smaller value tomaximize the compression rate of the compression algorithm.By default, Ceph uses five threads to process I/O requests in an OSD process.After the compression algorithm is enabled, the number of threads can causea performance bottleneck. Increase the number of threads to maximize theperformance of the compression algorithm.The following table describes the PG distribution parameters:
Parameter Description Suggestion
bluestore_min_alloc_size_hdd
Minimum size of objectsallocated to the HDDdata disks in theBlueStore storageengine
Default value: 32768Recommended value: 8192
osd_op_num_shards_hdd
Number of shards for anHDD data disk in anOSD process
Default value: 5Recommended value: 12
osd_op_num_threads_per_shard_hdd
Average number ofthreads of an OSDprocess for each HDDdata disk shard
Default value: 1Recommended value: 2
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 24
Enabling BcacheBcache is a block layer cache of the Linux kernel. It uses SSDs as the cache ofHDDs for acceleration. To enable the Bcache kernel module, you need torecompile the kernel. For details, see the Bcache User Guide (CentOS 7.6).
Using the I/O Passthrough ToolThe I/O passthrough tool is a process optimization tool for balanced scenarios ofthe Ceph cluster. It can automatically detect and optimize OSDs in the Cephcluster. For details on how to use this tool, see the I/O Passthrough Tool UserGuide.
2.2.4 KAE zlib Compression Tuning● Purpose
Optimize zlib compression to maximize the CPU capability of processing OSDsand maximize the hardware performance.
● Procedurezlib compression is processed by the KAE.
Preparing the EnvironmentNO TE
Before installing the accelerator engine, you need to apply for and install a license.License application guide:https://support.huawei.com/enterprise/zh/doc/EDOC1100068122/b9878159Installation guide:https://support.huawei.com/enterprise/en/doc/EDOC1100048786/ba20dd15
Download the acceleration engine installation package and developer Guide.
Download link: https://github.com/kunpengcompute/KAE/tags
Installing the Acceleration EngineNO TE
The developer guide describes how to install and use all modules of the accelerator engine.Select an appropriate installation mode based on the developer guide.For details, see Installing the KAE Software Package Using Source Code.
Step 1 Install the acceleration engine according to the developer guide.
Step 2 Install the zlib library.
1. Download KAEzip.2. Download zlib-1.2.11.tar.gz from the zlib official website and copy it to
KAEzip/open_source.3. Perform the compilation and installation.
cd KAEzipsh setup.sh install
The zlib library is installed in /usr/local/kaezip.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 25
Step 3 Back up the connection.mv /lib64/libz.so.1 /lib64/libz.so.1-bak
Step 4 Replace the zlib software compression algorithm dynamic library.cd /usr/local/kaezip/libcp libz.so.1.2.11 /lib64/mv /lib64/libz.so.1 /lib64/libz.so.1-bakln -s /lib64/libz.so.1.2.11 /lib64/libz.so.1
NO TE
In the cd /usr/local/zlib command, /usr/local/zlib indicates the zlib installation path.Change it as required.
----End
NO TE
If the Ceph cluster is running before the dynamic library is replaced, run the followingcommand on all storage nodes to restart the OSDs for the change to take effect after thedynamic library is replaced:systemctl restart ceph-osd.target
Changing the Default Number of Accelerator QueuesNO TE
The default number of queues of the hardware accelerator is 256. To fully utilize theperformance of the accelerator, change the number of queues to 512 or 1024.
Step 1 Remove hisi_zip.rmmod hisi_zip
Step 2 Set the default accelerator queue parameter pf_q_num=512.vi /etc/modprobe.d/hisi_zip.confoptions hisi_zip uacce_mode=2 pf_q_num=512
Step 3 Load hisi_zip.modprobe hisi_zip
Step 4 Check the hardware accelerator queue.cat /sys/class/uacce/hisi_zip-*/attrs/available_instances
The change is successful if the following information is displayed.
Step 5 Check the dynamic library links. If libwd.so.1 is contained in the command output,the operation is successful.ldd /lib64/libz.so.1
----End
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 26
Adapting Ceph to the AcceleratorNO TE
Currently, the mainline Ceph versions allow configuring the zlib compression mode usingthe configuration file. The released Ceph release versions (up to v15.2.3) adopt the zlibcompression mode without the data header and tail. However, the current hardwareacceleration library supports only the mode with the data header and tail. Therefore, theCeph source code needs to be modified to adapt to the Kunpeng hardware accelerationlibrary. For details about the modification method, see the latest patch that has beenincorporated into the mainline version:https://github.com/ceph/ceph/pull/34852The following uses Ceph 14.2.11 as an example to describe how Ceph adapts to the zlibcompression engine.
Step 1 Obtain the source code.
Source code download address: https://download.ceph.com/tarballs/
After the source code package is downloaded, save it to the /home directory onthe server.
Step 2 Obtain the patch and save it to the /home directory.
https://github.com/kunpengcompute/ceph/releases/download/v14.2.11/ceph-14.2.11-glz.patch
Step 3 Go to the /home directory, decompress the source code package and enter thedirectory generated after decompression.cd /home && tar -zxvf ceph-14.2.11.tar.gz && cd ceph-14.2.11/
Step 4 Apply the patch in the root directory of the source code.cd /home/ceph-14.2.11patch -p1 < ceph-14.2.11-glz.patch
Step 5 After modifying the source code, compile Ceph.● CentOS: See Ceph 14.2.1 Porting Guide (CentOS 7.6).● openEuler: See Ceph 14.2.8 Porting Guide (openEuler 20.03).
Step 6 Install Ceph.
Step 7 Modify the ceph.conf file to configure the zlib compression mode.vi /etc/ceph/ceph.confcompressor_zlib_winsize=15
Step 8 Restart the Ceph cluster for the configuration to take effect.ceph daemon osd.0 config show|grep compressor_zlib_winsize
----End
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 27
2.3 High-Performance Storage
2.3.1 Hardware TuningHigh-Performance Configuration Tuning
● PurposeBalance the loads of the two CPUs.
● ProcedureEvenly distribute the NVMe SSDs and NICs to the two CPUs.
HardwareType
OptimizationItem
Description
NIC NUMA resourcebalancing
For example, you can insert the LOMsinto the PCIe slots of CPU 1 and theMellanox ConnectX-4 NICs into thePCIe slots of CPU 2 to balance theloads of the two CPUs.
Storage NUMA resourcebalancing
For example, you can insert six NVMeSSDs into the PCIe slots of CPU 1 andthe other six NVMe SSDs into the PCIeslots of CPU 2 to balance the loads ofthe two CPUs.
2.3.2 System Tuning
Optimizing the OS Configuration● Purpose
Adjust the system configuration to maximize the hardware performance.● Procedure
Table 2-14 lists the optimization items.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 28
Table 2-14 OS configuration parameters
Parameter Description Suggestion ConfigurationMethod
vm.swappiness
The swappartition is thevirtual memoryof the system.Do not use theswap partitionbecause it willdeterioratesystemperformance.
Default value: 60Symptom: Theperformancedeterioratessignificantly whenthe swappartition is used.Suggestion:Disable the swappartition and setthis parameter to0.
Run the followingcommand:sudo sysctl vm.swappiness=0
MTU Maximum size ofa data packetthat can passthrough a NIC.After the value isincreased, thenumber ofnetwork packetscan be reducedand theefficiency can beimproved.
Default value:1500 bytesSymptom: Runthe ip addrcommand to viewthe value.Suggestion: Setthe maximumsize of a datapacket that canpass through aNIC to 9000bytes.
1. Run thefollowingcommand:vi /etc/sysconfig/network-scripts/ifcfg-$(Interface)AddMTU="9000".NOTE
${Interface}indicates thenetwork portname.
2. After theconfiguration iscomplete, restartthe networkservice.service network restart
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 29
Parameter Description Suggestion ConfigurationMethod
pid_max The defaultvalue ofpid_max is32768, which issufficient innormal cases.However, whenheavy workloadsare beingprocessed, thisvalue isinsufficient andmay causememoryallocationfailure.
Default value:32768Symptom: Runthe cat /proc/sys/kernel/pid_maxcommand to viewthe value.Suggestion: Setthe maximumnumber ofthreads that canbe generated inthe system to4194303.
Run the followingcommand:echo 4194303 > /proc/sys/kernel/pid_max
file_max Maximumnumber of filesthat can beopened by allprocesses in thesystem. Inaddition, someprograms cancall the setrlimitinterface to setthe limit on eachprocess. If thesystem generatesa large numberof errorsindicating thatfile handles areused up, increasethe value of thisparameter.
Default value:13291808Symptom: Runthe cat /proc/sys/fs/file-max command toview the value.Suggestion: Setthe maximumnumber of filesthat can beopened by allprocesses in thesystem to thevalue displayedafter the cat /proc/meminfo |grep MemTotal |awk '{print $2}'command is run.
Run the followingcommand:echo ${file-max} > /proc/sys/fs/file-max
NOTE${file-max} is thevalue displayed afterthe cat /proc/meminfo | grepMemTotal | awk'{print $2}' is run.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 30
Parameter Description Suggestion ConfigurationMethod
read_ahead Linux readaheadmeans that theLinux kernelprefetches acertain area ofthe specified fileand loads it intothe page cache.As a result, whenthe area isaccessedsubsequently,block caused bypage fault willnot occur.Reading datafrom memory ismuch faster thanreading datafrom drives.Therefore, thereadaheadfeature caneffectivelyreduce thenumber of driveseeks and theI/O waiting timeof theapplications. It isone of theimportantmethods foroptimizing thedrive read I/Operformance.
Default value:128 KBSymptom:Readahead caneffectively reducethe number ofdrive seeks andthe I/O waitingtime of theapplications. Runthe /sbin/blockdev --getra /dev/sdb toview the value.Suggestion:Change the valueto 8192 KB.Improve the driveread efficiency bypre-reading andrecording thedata to randomaccess memory(RAM).
Run the followingcommand:/sbin/blockdev --setra /dev/sdb
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 31
Parameter Description Suggestion ConfigurationMethod
I/O_Scheduler
The Linux I/Oscheduler is acomponent ofthe Linux kernel.You can adjustthe scheduler tooptimize systemperformance.
Default value:CFQSymptom: TheLinux I/Oscheduler needsto be configuredbased ondifferent storagedevices for theoptimal systemperformance.Suggestion: Setthe I/Oscheduling policyto deadline forHDDs and noopfor SSDs.
Run the followingcommand:echo deadline > /sys/block/sdb/queue/scheduler
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
nr_requests If the Linuxsystem receives alarge number ofread requests,the defaultnumber ofrequest queuesmay beinsufficient. Todeal with thisproblem, you candynamicallyadjust thedefault numberof requestqueues inthe /sys/block/hda/queue/nr_requests file.
Default value:128Symptom:Increase the drivethroughput byadjusting thenr_requestsparameter.Suggestion: Setthe number ofdrive requestqueues to 512.
Run the followingcommand:echo 512 > /sys/block/sdb/queue/nr_requests
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
NUMA Affinity Tuning● Procedure
Evenly allocate network and storage resources to NUMA nodes.● Purpose
In this example, 12 NVMe SSDs and four network ports are evenly allocatedto four NUMA nodes.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 32
The NVMe SSD numbers range from 0 to 11, and the network port names areenps0f0, enps0f1, enps0f2, and enps0f3.for i in {0..11}; do echo `expr ${i} / 3` > /sys/class/block/nvme${i}n1/device/device/numa_node; donefor j in {0..3}; do echo ${j} > /sys/class/net/enps0f${j}/device/numa_node; done
2.3.3 Ceph Tuning
Ceph Configuration Tuning● Purpose
Adjust the Ceph configuration items to fully utilize the hardware performanceof the system.
● Procedure
You can edit the /etc/ceph/ceph.conf file to modify all Ceph configurationparameters.
For example, to change the number of copies to 4, you can addosd_pool_default_size = 4 to the /etc/ceph/ceph.conf file and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.
The preceding operations take effect only on the current Ceph node. You needto modify the ceph.conf file on all Ceph nodes and restart the Ceph daemonprocess for the modification to take effect on the entire Ceph cluster.
Table 2-15 lists the optimization items.
Table 2-15 Ceph parameter configuration
Parameter Description Suggestion
[global]
osd_pool_default_min_size
Minimum number of I/Ocopies that the PG canreceive. If a PG is in thedegraded state, its I/Ocapability is notaffected.
Default value: 0Suggestion: Set thisparameter to 1.
cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.
Default value: noneSuggestion: Set thisparameter to192.168.4.0/24.
osd_pool_default_size
Number of copies Default value: 3Suggestion: Set thisparameter to 3.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 33
Parameter Description Suggestion
mon_max_pg_per_osd
PG alarm threshold. Youcan increase the valuefor better performance.
Default value: 250Suggestion: Set thisparameter to 3000.
mon_max_pool_pg_num
PG alarm threshold. Youcan increase the valuefor better performance.
Default value: 65536Suggestion: Set thisparameter to 300000.
debug_none Disable the debuggingfunction to reduce thelog printing overheads.
Suggestion: Set thisparameter to 0/0.
debug_lockdep
debug_context
debug_crush
debug_mds
debug_mds_balancer
debug_mds_locker
debug_mds_log
debug_mds_log_expire
debug_mds_migrator
debug_buffer
debug_timer
debug_filer
debug_striper
debug_objecter
debug_rados
debug_rbd
debug_rbd_mirror
debug_rbd_replay
debug_journaler
debug_objectcacher
debug_client
debug_osd
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 34
Parameter Description Suggestion
debug_optracker
debug_objclass
debug_filestore
debug_journal
debug_ms
debug_mon
debug_monc
debug_paxos
debug_tp
debug_auth
debug_crypto
debug_finisher
debug_reserver
debug_heartbeatmap
debug_perfcounter
debug_rgw
debug_civetweb
debug_javaclient
debug_asok
debug_throttle
debug_refs
debug_xio
debug_compressor
debug_bluestore
debug_bluefs
debug_bdev
debug_kstore
debug_rocksdb
debug_leveldb
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 35
Parameter Description Suggestion
debug_memdb
debug_kinetic
debug_fuse
debug_mgr
debug_mgrc
debug_dpdk
debug_eventtrace
throttler_perf_counter
This function is enabledby default. You cancheck whether thethreshold is abottleneck. After theoptimal performance isobtained, you areadvised to disable thetracker. The trackeraffects the performance.
Default value: TrueSuggestion: Set thisparameter to False.
ms_dispatch_throttle_bytes
Maximum number ofmessages to bescheduled. You areadvised to increase thevalue to improve themessage processingefficiency.
Default value: 104857600Suggestion: Set thisparameter to 2097152000.
ms_bind_before_connect
Message queue binding,which ensures thattraffic of multiplenetwork ports isbalanced.
Default value: FalseSuggestion: Set thisparameter to True.
[client]
rbd_cache Disable the client cache.After the function isdisabled, the RBD cacheis always inwritethrough mode.
Default value: TrueSuggestion: Set thisparameter to False.
[osd]
osd_max_write_size
Maximum size (in MB)of data that can bewritten by an OSD at atime
Default value: 90Suggestion: Set thisparameter to 256.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 36
Parameter Description Suggestion
osd_client_message_size_cap
Maximum size (in bytes)of data that can bestored in the memory bythe clients
Default value: 524288000Suggestion: Set thisparameter to 1073741824.
osd_map_cache_size
Size of the cache (inMB) that stores the OSDmap
Default value: 50Suggestion: Set thisparameter to 1024.
bluestore_rocksdb_options
RocksDB configurationparameter
Default value: compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_background_compactions=2
Suggestion:compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLe-vel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=16,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,max_bytes_for_level_base=6GB,compaction_threads=8,flusher_threads=4,compaction_readahead_size=2MB
bluestore_csum_type
The checksum type isnot specified.
Default value: crc32cSuggestion: none
mon_osd_full_ratio
Percentage of used drivespace when an OSD isconsidered to be full.When the data volumeexceeds this percentage,all read and writeoperations are stoppeduntil the drive space isexpanded or data iscleared so that thepercentage of used drivespace is less than thevalue.
Default value: 0.95Suggestion: Set thisparameter to 0.97.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 37
Parameter Description Suggestion
mon_osd_nearfull_ratio
Percentage of used drivespace when an OSD isregarded as almost usedup. When the datavolume exceeds thispercentage, an alarm isgenerated indicatingthat the space is aboutto be used up.
Default value: 0.85Suggestion: Set thisparameter to 0.95.
osd_min_pg_log_entries
Lower limit of thenumber of PG logs
Default value: 3000Suggestion: Set thisparameter to 10.
osd_max_pg_log_entries
Upper limit of thenumber of PG logs
Default value: 3000Suggestion: Set thisparameter to 10.
bluestore_cache_meta_ratio
Ratio of BlueStore cacheallocated to metadata.
Default value: 0.4Suggestion: Set thisparameter to 0.8.
bluestore_cache_kv_ratio
Ratio of BlueStore cacheallocated to key/valuedata.
Default value: 0.4Suggestion: Set thisparameter to 0.2.
Optimizing the PG Distribution● Purpose
Adjust the number of PGs on each OSD to balance the load on each OSD.● Procedure
By default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}pgp_num {pgp-num} command to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand is used to enable or disable the Ceph balancer function.Table 2-16 describes the PG distribution parameters.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 38
Table 2-16 PG distribution parameters
Parameter Description Suggestion
pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.
Default value: 8Symptom: A warning isdisplayed if the number ofPGs is insufficient.Suggestion: Calculate thevalue based on theformula.
pgp_num Set the number of PGPsto be the same as thatof PGs.
Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as the number ofPGs.Suggestion: Calculate thevalue based on theformula.
ceph_balancer_mode
Enable the balancerplug-in and set theplug-in mode toupmap.
Default value: noneSymptom: If the number ofPGs is unbalanced, someOSDs may be overloadedand become bottlenecks.Recommended value:upmap
NO TE
● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.
● Run the eph balancer mode upmap and ceph balancer on commands toautomatically balance and optimize Ceph PGs. Ceph adjusts the distribution of afew PGs every 60 seconds. Run the ceph balancer eval or ceph pg dumpcommand to view the PG distribution. If the PG distribution does not change, thedistribution is optimal.
● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs corresponding to each OSD, thedistribution of the primary PGs also needs to be optimized. That is, the primaryPGs need to be distributed to each OSD as evenly as possible.
Binding OSDs to CPU Cores● Purpose
Bind each OSD process to a fixed CPU core.● Procedure
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 39
Add osd_numa_node = <NUM> to the /etc/ceph/ceph.conf file.Table 2-17 describes the optimization items.
Table 2-17 OSD core binding parameters
Parameter Description Suggestion
[osd.n]
osd_numa_node Bind the osd.n daemonprocess to a specified idleNUMA node, which is anode other than thenodes that process theNIC software interrupt.
This parameter has nodefault value.Symptom: If the CPU ofeach OSD process is thesame as that of the NICinterrupt, some CPUs maybe overloaded.Suggestion: To balance theCPU load pressure, avoidrunning each OSD processand NIC interrupt process(or other processes withhigh CPU usage) on thesame NUMA node.
NO TE
● The Ceph OSD daemon process and NIC software interrupt process must run ondifferent NUMA nodes. Otherwise, CPU bottlenecks may occur when the networkload is heavy. By default, Ceph evenly allocates OSD processes to all CPU cores.You can add the osd_numa_node parameter to the ceph.conf file to avoid runningeach OSD process and NIC interrupt process (or other processes with high CPUusage) on the same NUMA node.
● Optimizing the Network Performance describes how to bind NIC softwareinterrupts to the CPU core of the NUMA node to which the NIC belongs. When thenetwork load is heavy, the usage of the CPU core bound to the software interruptsis high. Therefore, you are advised to set osd_numa_node to a NUMA nodedifferent from that of the NIC. For example, run the cat /sys/class/net/PortName/device/numa_node command to query the NUMA node of the NIC. If theNIC belongs to NUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 toprevent the OSD and NIC software interrupt from using the same CPU core.
Kunpeng BoostKit for SDSTuning Guide 2 Ceph Block Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 40
3 Ceph Object Storage Tuning Guide
3.1 Introduction
3.2 Cold Storage
3.3 General-Purpose Storage
3.4 High-Performance Storage
3.1 Introduction
3.1.1 Overview
CephCeph is a distributed, scalable, reliable, and high-performance storage systemplatform that supports storage interfaces including block devices, file systems, andobject gateways. The optimization methods described in this document includehardware optimization and software configuration optimization. Software codeoptimization is not involved. By adjusting the system and Ceph configurationparameters, Ceph can fully utilize the hardware performance of the system. CephPlacement Group (PG) distribution optimization and object storage daemon(OSD) core binding aim to balance drive loads and prevent any OSD frombecoming a bottleneck. In addition, in general-purpose storage scenarios, usingNVMe SSDs as Bcache can also improve performance. Figure 3-1 shows the Cepharchitecture.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 41
Figure 3-1 Ceph architecture
Table 3-1 describes the Ceph modules and components.
Table 3-1 Module functions
Module Function
RADOS Reliable Autonomic Distributed Object Store (RADOS) is theheart of a Ceph storage cluster. Everything in Ceph is stored byRADOS in the form of objects irrespective of their data types. TheRADOS layer ensures data consistency and reliability throughdata replication, fault detection and recovery, and data recoveryacross cluster nodes.
OSD Object storage daemons (OSDs) store the actual user data. EveryOSD is usually bound to one physical drive. The OSDs handle theread/write requests from clients.
MON The monitor (MON) is the most important component in a Cephcluster. It manages the Ceph cluster and maintains the status ofthe entire cluster. The MON ensures that related components ofa cluster can be synchronized at the same time. It functions asthe leader of the cluster and is responsible for collecting,updating, and publishing cluster information. To avoid singlepoints of failure (SPOFs), multiple MONs are deployed in a Cephenvironment, and they must handle the collaboration betweenthem.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 42
Module Function
MGR The manager (MGR) is a monitoring system that providescollection, storage, analysis (including alarming), andvisualization functions. It makes certain cluster parametersavailable for external systems.
Librados Librados is a method that simplifies access to RADOS. Currently,it supports programming languages PHP, Ruby, Java, Python, C,and C++. It provides RADOS, a local interface of the Ceph storagecluster, and is the base component of other services such as theRADOS block device (RBD) and RADOS gateway (RGW). Inaddition, it provides the Portable Operating System Interface(POSIX) for the Ceph file system (CephFS). The Librados API canbe used to directly access RADOS, enabling developers to createtheir own interfaces for accessing the Ceph cluster storage.
RBD The RADOS block device (RBD) is the Ceph block device thatprovides block storage for external systems. It can be mapped,formatted, and mounted like a drive to a server.
RGW The RADOS gateway (RGW) is a Ceph object gateway thatprovides RESTful APIs compatible with S3 and Swift. The RGWalso supports multi-tenant and OpenStack Identity service(Keystone).
MDS The Ceph Metadata Server (MDS) tracks the file hierarchy andstores metadata used only for CephFS. The RBD and RGW do notrequire metadata. The MDS does not directly provide dataservices for clients.
CephFS The CephFS provides a POSlX-compatible distributed file systemof any size. It depends on the Ceph MDS to track the filehierarchy, namely the metadata.
3.1.2 Environment
Physical NetworkingThe physical environment of the Ceph block devices contains two network layersand three nodes. In the physical environment, the MON, MGR, MDS, and OSDnodes are deployed together. At the network layer, the public network is separatedfrom the cluster network. The two networks use 25GE optical ports forcommunication.
Figure 3-2 shows the physical network.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 43
Figure 3-2 Physical networking
Hardware ConfigurationTable 3-2 shows the Ceph hardware configuration.
Table 3-2 Hardware configuration
Server TaiShan 200 server (model 2280)
Processor Kunpeng 920 5230 processor
Core 2 x 32-core
CPU frequency 2600 MHz
Memory capacity 12 x 16 GB
Memory frequency 2666 MHz (8 Micron 2R memory modules)
NIC IN200 NIC (4 x 25GE)
Drive System drives: RAID 1 (2 x 960 GB SATA SSDs)Data drives of general-purpose storage: JBOD enabledin RAID mode (12 x 4 TB SATA HDDs)
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 44
NVMe SSD Acceleration drive of general-purpose storage: 1 x 3.2TB ES3600P V5 NVMe SSDData drives of high-performance storage: 12 x 3.2 TBES3600P V5 NVMe SSDs
RAID controller card Avago SAS 3508
Software VersionsTable 3-3 lists the required software versions.
Table 3-3 Software versions
Software Version
OS CentOS Linux release 7.6.1810
openEuler 20.03 LTS SP1
Ceph 14.2.1 Nautilus
ceph-deploy 2.0.1
CosBench 0.4.2.c4
Node InformationTable 3-4 describes the IP network segment planning of the hosts.
Table 3-4 Node information
Host Type HostName
Public NetworkSegment
Cluster NetworkSegment
OSD/MON node Node 1 192.168.3.0/24 192.168.4.0/24
OSD/MGR node Node 2 192.168.3.0/24 192.168.4.0/24
OSD/MDS node Node 3 192.168.3.0/24 192.168.4.0/24
Component DeploymentTable 3-5 describes the deployment of service components in the Ceph blockdevice cluster.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 45
Table 3-5 Component deployment
Physical MachineName
OSD MON MGR
Node 1 12 OSDs 1 MON 1 MGR
Node 2 12 OSDs 1 MON 1 MGR
Node 3 12 OSDs 1 MON 1 MGR
Cluster CheckRun the ceph health command to check the cluster health status. If HEALTH_OKis displayed, the cluster is running properly.
3.1.3 Tuning Guidelines and Process FlowThe object storage tuning varies with the hardware configuration.
● Cold storageAll data drives are hard disk drives (HDDs). That is, DB/WAL partitions andmetadata storage pools use HDDs.
● General-purpose storageHDDs are used as data drives, and SSDs are used as DB and WAL partitionsand metadata storage pools.
● High-performance storageAll data drives are SSDs.
Perform the tuning based on your hardware configuration.
Tuning GuidelinesPerformance optimization must comply with the following principles:
● When analyzing the performance, analyze the system resource bottlenecksfrom multiple aspects. For example, insufficient memory capacity may causethe CPU to be occupied by memory scheduling tasks and the CPU usage toreach 100%.
● Adjust only one performance parameter at a time.● The analysis tool may consume system resources and aggravate certain
system resource bottlenecks. Therefore, the impact on applications must beavoided or minimized.
Tuning Process FlowThe tuning analysis flow is as follows:
1. In many cases, pressure test traffic is not completely sent to the backend(server). For example, a protection policy may be triggered on network accesslayer services such as Server Load Balancing (SLB), Web Application Firewall(WAF), High Defense IP, and even Content Delivery Network (CDN) /site
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 46
acceleration in a cloud-based architecture. This occurs because thespecifications, such as bandwidth, maximum number of connections, andnumber of new connections, are limited, or the pressure test shows thefeatures of Challenge Collapsar (CC) and Distributed Denial of Service (DDoS)attacks. As a result, the pressure test results do not meet expectations.
2. Check whether the key indicators meet the requirements. If not, locate thefault. The fault may be caused by the servers (in most cases) or the clients (ina few cases).
3. If the problem is caused by the servers, focus on the hardware indicators suchas the CPU, memory, drive I/O, and network I/O. Locate the fault and performfurther analysis on the abnormal hardware indicator.
4. If all hardware indicators are normal, check the middleware indicators such asthe thread pool, connection pool, and GC indicators. Perform further analysisbased on the abnormal middleware indicator.
5. If all middleware indicators are normal, check the database indicators such asthe slow query SQL indicators, hit ratio, locks, and parameter settings.
6. If the preceding indicators are normal, the algorithm, buffer, cache,synchronization, or asynchronization of the applications may be faulty.Perform further analysis.
Table 3-6 lists the possible bottlenecks.
Table 3-6 Possible bottlenecks
Bottleneck
Description
Hardware/Specifications
Problems of the CPU, memory, and drive I/O. The problems areclassified into server hardware bottlenecks and network bottlenecks(Network bottlenecks can be ignored in a LAN).
Middleware
Problems of software such as application servers and web servers, anddatabase systems. For example, a bottleneck may be caused ifparameters of the Java Database Connectivity (JDBC) connection poolconfigured on the WebLogic platform are set improperly.
Applications
Problems related to applications developed by developers. Forexample, when the system receives a large number of user requests,the following problems may cause low system performance, includingslow SQL statements and improper Java Virtual Machine (JVM)parameters, container settings, database design, program architectureplanning, and program design (insufficient threads for serial processingand request processing, no buffer, no cache, and mismatch betweenproducers and consumers).
OS Problems related to the OS such as Windows, UNIX, or Linux. Forexample, if the physical memory capacity is insufficient and the virtualmemory capacity is improper during a performance test, the virtualmemory swap efficiency may be greatly reduced. As a result, theresponse time is increased. This bottleneck is caused by the OS.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 47
Bottleneck
Description
Networkdevices
Problems related to devices such as the firewalls, dynamic loadbalancers, and switches. Currently, more network access products areused in the cloud service architecture, including but not limited to theSLB, WAF, High Defense IP, CDN, and site acceleration. For example, ifa dynamic load distribution mechanism is set on the dynamic loadbalancer, the dynamic load balancer automatically sends subsequenttransaction requests to low-load servers when the hardware resourceusage of a server reaches the limit. If the dynamic load balancer doesnot function as expected in the test, the problem is a networkbottleneck.
General tuning procedure:
Figure 3-3 shows the general tuning procedure.
Figure 3-3 General tuning procedure
3.2 Cold Storage
3.2.1 Hardware Tuning
DIMM Installation Mode Tuning● Purpose
Populate one DIMM Per Channel (1DPC) to maximize the memoryperformance. That is, populate the DIMM 0 slot of each channel.
● ProcedurePreferentially populate DIMM 0 slots (DIMM 000, 010, 020, 030, 040, 050,100, 110, 120, 130, 140, and 150). Of the three digits in the DIMM slot
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 48
number, the first digit indicates the CPU, the second digit indicates the DIMMchannel, and the third digit indicates the DIMM. Populate the DIMM slotswhose third digit is 0 in ascending order.
3.2.2 System Tuning
Optimizing the Network Performance● Purpose
This test uses the 25GE Ethernet adapter (Hi1822) with four ports, SFP+. It isused as an example to describe how to optimize the NIC parameters for theoptimal performance.
● ProcedureThe optimization methods include adjusting NIC parameters and interrupt-core binding (binding interrupts to the physical CPU of the NIC). Table 3-7describes the optimization items.
Table 3-7 NIC parameters
Parameter Description Suggestion
irqbalance System interruptbalancing service, whichautomatically allocatesNIC software interrupts toidle CPUs.
Default value: activeSymptom: When thisfunction is enabled, thesystem automaticallyallocates NIC softwareinterrupts to idle CPUs.Suggestion:● To disable irqbalance,
set this parameter toinactive.systemctl stop irqbalance
● Keep the functiondisabled after theserver is restarted.systemctl disable irqbalance
rx_buff Aggregation of largenetwork packets requiresmultiple discontinuousmemory pages and causeslow memory usage. Youcan increase the value ofthis parameter to improvethe memory usage.
Default value: 2Symptom: When the valueis set to 2 by default,interrupts consume alarge number of CPUresources.Suggestion: Load therx_buff parameter and setthe value to 8 to reducediscontinuous memoryand improve memoryusage and performance.For details, see descriptionfollowing the table.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 49
Parameter Description Suggestion
ring_buffer You can increase thethroughput by adjustingthe NIC buffer size.
Default value: 1024Symptom: Run theethtool -g NIC namecommand to view thevalue.Suggestion: Change thering_buffer queue size to4096. For details, seedescription following thetable.
lro lro indicates large receiveoffload. After this functionis enabled, multiple smallpackets are aggregatedinto one large packet forbetter efficiency.
Default value: offSymptom: After thisfunction is enabled, themaximum throughputincreases significantly.Suggestion: Enable thelarge-receive-offloadfunction to help networksimprove the efficiency ofsending and receivingpackets. For details, seedescription following thetable.
hinicadm_lro_-ihinic0_-t_<NUM>
Received aggregatedpackets are sent after thetime specified by NUM (inmicroseconds). You canset the value to 256microseconds for betterefficiency.
Default value: 16microsecondsSymptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 256microseconds.
hinicadm_lro_-i_hinic0_-n_<NUM>
Received aggregatedpackets are sent after thenumber of aggregatedpackets reaches the valuespecified by <NUM>. Youcan set the value to 32 forbetter efficiency.
Default value: 4Symptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 32.
– Adjusting rx_buff
i. Go to the /etc/modprobe.d directory.cd /etc/modprobe.d
ii. Create the hinic.conf file.vi /etc/modprobe.d/hinic.conf
Add the following information to the file:
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 50
options hinic rx_buff=8
iii. Reload the driver.rmmod hinicmodprobe hinic
iv. Check whether the value of rx_buff is changed to 8.cat /sys/bus/pci/drivers/hinic/module/parameters/rx_buff
– Adjusting ring_buffer
i. Change the buffer size from the default value 1024 to 4096.ethtool -G <NIC name> rx 4096 tx 4096
ii. Check the current buffer size.ethtool -g <NIC name>
– Enabling LRO
i. Enable the LRO function for a NIC.ethtool -K <NIC name> lro on
ii. Check whether the function is enabled.ethtool -k <NIC name> | grep large-receive-offload
NO TE
In addition to optimizing the preceding parameters, you need to bind the NIC softwareinterrupts to the cores.
1. Disable the irqbalance service.
2. Query the NUMA node to which the NIC belongs:cat /sys/class/net/<Network port name>/device/numa_node
3. Query the CPU cores that correspond to the NUMA node.lscpu
4. Query the interrupt ID corresponding to the NIC.cat /proc/interrupts | grep <Network port name> | awk -F ':' '{print $1}'
5. Bind the software interrupt to the core corresponding to the NUMA node.echo <core number> > /proc/irq/ <Interrupt ID> /smp_affinity_list
Enabling SMMU Passthrough● Purpose
To maximize the performance of the Kunpeng processor, you are advised toenable SMMU passthrough.
● Procedure
Step 1 Edit the /etc/grub2-efi.cfg file.vi /etc/grub2-efi.cfg
Step 2 Find the line where vmlinuz-4.14.0-115.el7a.0.1.aarch64 is located in the kernelcode, add iommu.passthrough=1 to the end of the line, save the file and exit, andrestart the server.if [ x$feature_platform_search_hint = xy ]; thensearch --no-floppy --fs-uuid --set=root e343026a-d245-4812-95ce-6ff999b3571celsesearch --no-floppy --fs-uuid --set=root e343026a-d245-4812-95ce-6ff999b3571cfilinux /vmlinuz-4.14.0-115.el7a.0.1.aarch64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap LANG=en_US.UTF-8 iommu.passthrough=1initrd /initramfs-4.14.0-115.el7a.0.1.aarch64.img
----End
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 51
NO TE
This tuning procedure applies only to the Kunpeng computing platform.
4.14.0-115.el7a.0.1.aarch64 is the kernel version of CentOS 7.6. If you use another OS, runthe uname -r command to query the current kernel version, and addiommu.passthrough=1 at the end of the line where vmlinuz-Kernel version is located.
Optimizing the OS Configuration● Purpose
Adjust the system configuration to maximize the hardware performance.
● Procedure
Table 3-8 lists the optimization items.
Table 3-8 OS configuration parameters
Parameter Description Suggestion ConfigurationMethod
vm.swappiness
The swappartition is thevirtual memoryof the system.Do not use theswap partitionbecause it willdeterioratesystemperformance.
Default value: 60Symptom: Theperformancedeterioratessignificantly whenthe swappartition is used.Suggestion:Disable the swappartition and setthis parameter to0.
Run the followingcommand:sudo sysctl vm.swappiness=0
MTU Maximum size ofa data packetthat can passthrough a NIC.After the value isincreased, thenumber ofnetwork packetscan be reducedand theefficiency can beimproved.
Default value:1500 bytesSymptom: Runthe ip addrcommand to viewthe value.Suggestion: Setthe maximumsize of a datapacket that canpass through aNIC to 9000bytes.
1. Run thefollowingcommand:vi /etc/sysconfig/network-scripts/ifcfg-$(Interface)AddMTU="9000".NOTE
${Interface}indicates thenetwork portname.
2. After theconfiguration iscomplete, restartthe networkservice.service network restart
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 52
Parameter Description Suggestion ConfigurationMethod
pid_max The defaultvalue ofpid_max is32768, which issufficient innormal cases.However, whenheavy workloadsare beingprocessed, thisvalue isinsufficient andmay causememoryallocationfailure.
Default value:32768Symptom: Runthe cat /proc/sys/kernel/pid_maxcommand to viewthe value.Suggestion: Setthe maximumnumber ofthreads that canbe generated inthe system to4194303.
Run the followingcommand:echo 4194303 > /proc/sys/kernel/pid_max
file_max Maximumnumber of filesthat can beopened by allprocesses in thesystem. Inaddition, someprograms cancall the setrlimitinterface to setthe limit on eachprocess. If thesystem generatesa large numberof errorsindicating thatfile handles areused up, increasethe value of thisparameter.
Default value:13291808Symptom: Runthe cat /proc/sys/fs/file-max command toview the value.Suggestion: Setthe maximumnumber of filesthat can beopened by allprocesses in thesystem to thevalue displayedafter the cat /proc/meminfo |grep MemTotal |awk '{print $2}'command is run.
Run the followingcommand:echo ${file-max} > /proc/sys/fs/file-max
NOTE${file-max} is thevalue displayed afterthe cat /proc/meminfo | grepMemTotal | awk'{print $2}' is run.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 53
Parameter Description Suggestion ConfigurationMethod
read_ahead Linux readaheadmeans that theLinux kernelprefetches acertain area ofthe specified fileand loads it intothe page cache.As a result, whenthe area isaccessedsubsequently,block caused bypage fault willnot occur.Reading datafrom memory ismuch faster thanreading datafrom drives.Therefore, thereadaheadfeature caneffectivelyreduce thenumber of driveseeks and theI/O waiting timeof theapplications. It isone of theimportantmethods foroptimizing thedrive read I/Operformance.
Default value:128 KBSymptom:Readahead caneffectively reducethe number ofdrive seeks andthe I/O waitingtime of theapplications. Runthe /sbin/blockdev --getra /dev/sdb toview the value.Suggestion:Change the valueto 8192 KB.Improve the driveread efficiency bypre-reading andrecording thedata to randomaccess memory(RAM).
Run the followingcommand:/sbin/blockdev --setra /dev/sdb
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 54
Parameter Description Suggestion ConfigurationMethod
I/O_Scheduler
The Linux I/Oscheduler is acomponent ofthe Linux kernel.You can adjustthe scheduler tooptimize systemperformance.
Default value:CFQSymptom: TheLinux I/Oscheduler needsto be configuredbased ondifferent storagedevices for theoptimal systemperformance.Suggestion: Setthe I/Oscheduling policyto deadline forHDDs and noopfor SSDs.
Run the followingcommand:echo deadline > /sys/block/sdb/queue/scheduler
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
nr_requests If the Linuxsystem receives alarge number ofread requests,the defaultnumber ofrequest queuesmay beinsufficient. Todeal with thisproblem, you candynamicallyadjust thedefault numberof requestqueues inthe /sys/block/hda/queue/nr_requests file.
Default value:128Symptom:Increase the drivethroughput byadjusting thenr_requestsparameter.Suggestion: Setthe number ofdrive requestqueues to 512.
Run the followingcommand:echo 512 > /sys/block/sdb/queue/nr_requests
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
3.2.3 Ceph Tuning
Tuning Ceph Configuration● Purpose
Modify the Ceph configuration to maximize system resource utilization.● Procedure
You can modify the Ceph configuration in the /etc/ceph/ceph.conf file. Forexample, to change the number of copies to 4, you can add
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 55
osd_pool_default_size = 4 to the /etc/ceph/ceph.conf file and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.The setting takes effect only for the current Ceph node. To enable the settingsof the entire Ceph cluster to take effect, you need to modify the ceph.conffile of each Ceph node and restart the Ceph daemon process. Table 3-9describes the Ceph parameters to be modified.
Table 3-9 Ceph parameters
Parameter Description Suggestion
[global]
cluster_network Configures a networksegment different fromthe public network. Thisnetwork segment is usedfor replication and databalancing between OSDsto relieve the pressureon the public network.
Configure a networksegment that is differentfrom the public networksegment and set the valueto, for example,192.168.4.0/24.
public_network Configure a networksegment that is differentfrom the cluster networksegment and set the valueto, for example,192.168.3.0/24.
Table 3-10 describes other parameters that can be modified.
Table 3-10 Other parameters
Parameter Description Suggestion
[global]
osd_pool_default_min_size
Minimum number of I/Ocopies that the PG canreceive. If a PG is in thedegraded state, its I/Ocapability is notaffected.
Default value: 0Recommended value: 1
cluster_network Configures a networksegment different fromthe public network. Thisnetwork segment is usedfor replication and databalancing betweenOSDs to relieve thepressure on the publicnetwork.
Recommended value:192.168.4.0/24
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 56
Parameter Description Suggestion
osd_pool_default_size
Specifies the number ofcopies.
Default value: 3Recommended value: 3
osd_memory_target
Specifies the size ofmemory that each OSDprocess is allowed toobtain.
Default value: 4294967296Recommended value:4294967296
[mon]
mon_clock_drift_allowed
Specifies the clock driftbetween MONs.
Default value: 0.05Recommended value: 1
mon_osd_min_down_reporters
Specifies the minimumnumber of down OSDsthat triggers a report tothe MONs.
Default value: 2Recommended value: 13
mon_osd_down_out_interval
Specifies the duration(in seconds) for whichCeph waits before anOSD is marked as downor out.
Default value: 600Recommended value: 600
[OSD]
osd_journal_size Specifies the OSDjournal size.
Default value: 5120Recommended value:20000
osd_max_write_size
Specifies the maximumsize (in MB) of data thatcan be written by anOSD at a time.
Default value: 90Recommended value: 512
osd_client_message_size_cap
Specifies the maximumsize (in bytes) of datathat can be stored in thememory by the clients.
Default value: 100Recommended value:2147483648
osd_deep_scrub_stride
Specifies the number ofbytes that can be readduring deep scrubbing.
Default value: 524288Recommended value:131072
osd_map_cache_size
Specifies the size of thecache (in MB) thatstores the OSD map.
Default value: 50Recommended value: 1024
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 57
Parameter Description Suggestion
osd_recovery_op_priority
Specifies the priority ofthe restorationoperation. The valueranges from 1 to 63. Alarger value indicateshigher resource usage.
Default value: 3Recommended value: 2
osd_recovery_max_active
Specifies the maximumnumber of activerestoration requestsallowed at the sametime.
Default value: 3Recommended value: 10
osd_max_backfills Specifies the maximumnumber of backfillsallowed by an OSD.
Default value: 1Recommended value: 4
osd_min_pg_log_entries
Specifies the maximumnumber of PGLogs thatcan be recorded whenthe PG is normal.
Default value: 3000Recommended value:30000
osd_max_pg_log_entries
Specifies the maximumnumber of PGLogs thatcan be recorded whenthe PG is degraded.
Default value: 3000Recommended value:100000
osd_mon_heartbeat_interval
Specifies the interval (inseconds) for an OSD toping a MON.
Default value: 30Recommended value: 40
ms_dispatch_throttle_bytes
Specifies the maximumnumber of messages tobe dispatched.
Default value: 10485760Recommended value:1048576000
objecter_inflight_ops
Specifies the maximumnumber of unsent I/Orequests allowed. Thisparameter is used forclient traffic control. Ifthe number of unsentI/O requests exceeds thethreshold, theapplication I/O isblocked. The value 0indicates that thenumber of unsent I/Orequests is not limited.
Default value: 1024Recommended value:819200
osd_op_log_threshold
Specifies the number ofoperation logs displayedat a time.
Default value: 5Recommended value: 50
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 58
Parameter Description Suggestion
osd_crush_chooseleaf_type
Specifies the bucket typewhen the CRUSH ruleuses chooseleaf.
Default value: 1Recommended value: 0
journal_max_write_bytes
Specifies the maximumnumber of bytes thatcan be written to ajournal at a time.
Default value: 1048560Recommended value:1073714824
journal_max_write_entries
Specifies the maximumnumber of records thatcan be written to ajournal at a time.
Default value: 100Recommended value:10000
[Client]
rbd_cache Specifies the RBD cache. Default value: True(indicating that the RBDcache is enabled)Recommended value: True
rbd_cache_size Specifies the RBD cachesize (in bytes).
Default value: 33554432Recommended value:335544320
rbd_cache_max_dirty
Specifies the maximumnumber of dirty bytesallowed when the cacheis set to the writebackmode. If the value is 0,the cache is set to thewritethrough mode.
Default value: 25165824Recommended value:134217728
rbd_cache_max_dirty_age
Specifies the duration(in seconds) for whichthe dirty data is storedin the cache beforebeing flushed to thedrives.
Default value: 1Recommended value: 30
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 59
Parameter Description Suggestion
rbd_cache_writethrough_until_flush
This parameter is usedto ensure compatibilitywith the VirtIO driverearlier than Linux-2.6.32.It allows data to bewritten back when noflush request is sent. Ifthis parameter is set toTrue, librbd processesI/Os in writethroughmode, and switches tothe writeback mode onlywhen the first flushrequest is received.
Default value: TrueRecommended value: False
rbd_cache_max_dirty_object
Specifies the maximumnumber of objects. Thedefault value is 0, whichindicates that thenumber of objects iscalculated based on theRBD cache size. Bydefault, librbd logicallysplits the drive image ina unit of 4 MB.Each chunk object isabstracted as an object.librbd manages thecache objects. You canincrease the value of thisparameter to improvethe performance.
Default value: 0Recommended value: 2
rbd_cache_target_dirty
Specifies the dirty datasize that triggerswriteback. The valuecannot exceed the valueofrbd_cache_max_dirty.
Default value: 16777216Recommended value:235544320
Optimizing the PG Distribution● Purpose
Adjust the number of PGs on each OSD to balance the load on each OSD.● Procedure
By default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 60
pgp_num {pgp-num} command to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand is used to enable or disable the Ceph balancer function.Table 3-11 describes the PG distribution parameters.
Table 3-11 PG distribution parameters
Parameter Description Suggestion
pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.
Default value: 8Symptom: A warning isdisplayed if the number ofPGs is insufficient.Suggestion: Calculate thevalue based on theformula.
pgp_num Set the number of PGPsto be the same as thatof PGs.
Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as the number ofPGs.Suggestion: Calculate thevalue based on theformula.
ceph_balancer_mode
Enable the balancerplug-in and set theplug-in mode toupmap.
Default value: noneSymptom: If the number ofPGs is unbalanced, someOSDs may be overloadedand become bottlenecks.Recommended value:upmap
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 61
NO TE
● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.
● Run the eph balancer mode upmap and ceph balancer on commands toautomatically balance and optimize Ceph PGs. Ceph adjusts the distribution of afew PGs every 60 seconds. Run the ceph balancer eval or ceph pg dumpcommand to view the PG distribution. If the PG distribution does not change, thedistribution is optimal.
● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs corresponding to each OSD, thedistribution of the primary PGs also needs to be optimized. That is, the primaryPGs need to be distributed to each OSD as evenly as possible.
Binding OSDs and RGWs to CPU Cores● Purpose
Bind the OSD and RGW processes to fixed CPU cores to prevent certain CPUcores from being overloaded.
● ProcedureWhen NIC software interrupts and Ceph processes share CPUs under heavynetwork load, certain CPUs may be overloaded and become bottlenecks,compromising the Ceph cluster performance. To solve the problem, bind theNIC software interrupts and Ceph processes to different CPU cores. Table3-12 describes the parameters to be modified.
Table 3-12 Binding OSDs and RGWs to CPU cores
Parameter Description Suggestion
osd.[N] Binds the osd.n daemonprocess to a specifiedidle NUMA node, whichis a node other than thenodes that process NICsoftware interrupts.
Default value: noneSuggestion: Bind the osd.Ndaemon process to specifiedCPU cores that do not processNIC software interrupts toprevent the CPU frombecoming bottlenecks.
rgw.[N] Binds the RGW daemonprocess to a specifiedidle NUMA node, whichis a node other than thenodes that process theNIC software interrupt.
Default value: noneBind the rgw.N daemonprocess to CPU cores that dono process NIC softwareinterrupts to prevent the CPUfrom becoming thebottlenecks.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 62
NO TE
The Ceph OSD/RGW daemon process and NIC software interrupt process must run ondifferent CPU cores. Otherwise, CPU bottlenecks may occur when the network load isheavy.
Run the following commands on all Ceph nodes to bind the CPU cores:for i in `ps -ef | grep rgw | grep -v grep | awk '{print $2}'`; do taskset -pc 4-47 $i; donefor i in `ps -ef | grep osd | grep -v grep | awk '{print $2}'`; do taskset -pc 4-47 $i; done
NO TE
Optimizing the Network Performance describes how to bind NIC software interruptsto the CPU core of the NUMA node to which the NIC belongs. When the network loadis heavy, the usage of the CPU core bound to the software interrupts is high.Therefore, you are advised to set osd_numa_node to a NUMA node different fromthat of the NIC. For example, run the cat /sys/class/net/<Port Name>/device/numa_node command to query the NUMA node of the NIC. If the NIC belongs toNUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 to prevent the OSDand NIC software interrupt from using the same CPU core. The core binding of theRGW is similar to that of the OSD. After finding an idle NUMA node, you can run thelscpu command to query the ID of the CPU core that corresponds to the NUMA node.In the preceding command line, 4-47 indicates the idle CPU core of the node. Changethe value as required.
3.3 General-Purpose Storage
3.3.1 Hardware Tuning
NVMe SSD Tuning● Purpose
Reduce cross-chip data overheads.● Procedure
Install the NVMe SSDs and NIC into the same riser card.
DIMM Installation Mode Tuning● Purpose
Populate one DIMM Per Channel (1DPC) to maximize the memoryperformance. That is, populate the DIMM 0 slot of each channel.
● ProcedurePreferentially populate DIMM 0 slots (DIMM 000, 010, 020, 030, 040, 050,100, 110, 120, 130, 140, and 150). Of the three digits in the DIMM slotnumber, the first digit indicates the CPU, the second digit indicates the DIMMchannel, and the third digit indicates the DIMM. Populate the DIMM slotswhose third digit is 0 in ascending order.
3.3.2 System Tuning
Optimizing the Network Performance● Purpose
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 63
This test uses the 25GE Ethernet adapter (Hi1822) with four ports, SFP+. It isused as an example to describe how to optimize the NIC parameters for theoptimal performance.
● ProcedureThe optimization methods include adjusting NIC parameters and interrupt-core binding (binding interrupts to the physical CPU of the NIC). Table 3-13describes the optimization items.
Table 3-13 NIC parameters
Parameter Description Suggestion
irqbalance System interruptbalancing service, whichautomatically allocatesNIC software interrupts toidle CPUs.
Default value: activeSymptom: When thisfunction is enabled, thesystem automaticallyallocates NIC softwareinterrupts to idle CPUs.Suggestion:● To disable irqbalance,
set this parameter toinactive.systemctl stop irqbalance
● Keep the functiondisabled after theserver is restarted.systemctl disable irqbalance
rx_buff Aggregation of largenetwork packets requiresmultiple discontinuousmemory pages and causeslow memory usage. Youcan increase the value ofthis parameter to improvethe memory usage.
Default value: 2Symptom: When the valueis set to 2 by default,interrupts consume alarge number of CPUresources.Suggestion: Load therx_buff parameter and setthe value to 8 to reducediscontinuous memoryand improve memoryusage and performance.For details, see descriptionfollowing the table.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 64
Parameter Description Suggestion
ring_buffer You can increase thethroughput by adjustingthe NIC buffer size.
Default value: 1024Symptom: Run theethtool -g NIC namecommand to view thevalue.Suggestion: Change thering_buffer queue size to4096. For details, seedescription following thetable.
lro lro indicates large receiveoffload. After this functionis enabled, multiple smallpackets are aggregatedinto one large packet forbetter efficiency.
Default value: offSymptom: After thisfunction is enabled, themaximum throughputincreases significantly.Suggestion: Enable thelarge-receive-offloadfunction to help networksimprove the efficiency ofsending and receivingpackets. For details, seedescription following thetable.
hinicadm_lro_-ihinic0_-t_<NUM>
Received aggregatedpackets are sent after thetime specified by NUM (inmicroseconds). You canset the value to 256microseconds for betterefficiency.
Default value: 16microsecondsSymptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 256microseconds.
hinicadm_lro_-i_hinic0_-n_<NUM>
Received aggregatedpackets are sent after thenumber of aggregatedpackets reaches the valuespecified by <NUM>. Youcan set the value to 32 forbetter efficiency.
Default value: 4Symptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 32.
– Adjusting rx_buff
i. Go to the /etc/modprobe.d directory.cd /etc/modprobe.d
ii. Create the hinic.conf file.vi /etc/modprobe.d/hinic.conf
Add the following information to the file:
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 65
options hinic rx_buff=8
iii. Reload the driver.rmmod hinicmodprobe hinic
iv. Check whether the value of rx_buff is changed to 8.cat /sys/bus/pci/drivers/hinic/module/parameters/rx_buff
– Adjusting ring_buffer
i. Change the buffer size from the default value 1024 to 4096.ethtool -G <NIC name> rx 4096 tx 4096
ii. Check the current buffer size.ethtool -g <NIC name>
– Enabling LRO
i. Enable the LRO function for a NIC.ethtool -K <NIC name> lro on
ii. Check whether the function is enabled.ethtool -k <NIC name> | grep large-receive-offload
NO TE
In addition to optimizing the preceding parameters, you need to bind the NIC softwareinterrupts to the cores.
1. Disable the irqbalance service.
2. Query the NUMA node to which the NIC belongs:cat /sys/class/net/<Network port name>/device/numa_node
3. Query the CPU cores that correspond to the NUMA node.lscpu
4. Query the interrupt ID corresponding to the NIC.cat /proc/interrupts | grep <Network port name> | awk -F ':' '{print $1}'
5. Bind the software interrupt to the core corresponding to the NUMA node.echo <core number> > /proc/irq/ <Interrupt ID> /smp_affinity_list
Enabling SMMU Passthrough● Purpose
To maximize the performance of the Kunpeng processor, you are advised toenable SMMU passthrough.
● Procedure
Step 1 Edit the /etc/grub2-efi.cfg file.vi /etc/grub2-efi.cfg
Step 2 Find the line where vmlinuz-4.14.0-115.el7a.0.1.aarch64 is located in the kernelcode, add iommu.passthrough=1 to the end of the line, save the file and exit, andrestart the server.if [ x$feature_platform_search_hint = xy ]; thensearch --no-floppy --fs-uuid --set=root e343026a-d245-4812-95ce-6ff999b3571celsesearch --no-floppy --fs-uuid --set=root e343026a-d245-4812-95ce-6ff999b3571cfilinux /vmlinuz-4.14.0-115.el7a.0.1.aarch64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap LANG=en_US.UTF-8 iommu.passthrough=1initrd /initramfs-4.14.0-115.el7a.0.1.aarch64.img
----End
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 66
NO TE
This tuning procedure applies only to the Kunpeng computing platform.
4.14.0-115.el7a.0.1.aarch64 is the kernel version of CentOS 7.6. If you use another OS, runthe uname -r command to query the current kernel version, and addiommu.passthrough=1 at the end of the line where vmlinuz-Kernel version is located.
Optimizing the OS Configuration● Purpose
Adjust the system configuration to maximize the hardware performance.
● Procedure
Table 3-14 lists the optimization items.
Table 3-14 OS configuration parameters
Parameter Description Suggestion ConfigurationMethod
vm.swappiness
The swappartition is thevirtual memoryof the system.Do not use theswap partitionbecause it willdeterioratesystemperformance.
Default value: 60Symptom: Theperformancedeterioratessignificantly whenthe swappartition is used.Suggestion:Disable the swappartition and setthis parameter to0.
Run the followingcommand:sudo sysctl vm.swappiness=0
MTU Maximum size ofa data packetthat can passthrough a NIC.After the value isincreased, thenumber ofnetwork packetscan be reducedand theefficiency can beimproved.
Default value:1500 bytesSymptom: Runthe ip addrcommand to viewthe value.Suggestion: Setthe maximumsize of a datapacket that canpass through aNIC to 9000bytes.
1. Run thefollowingcommand:vi /etc/sysconfig/network-scripts/ifcfg-$(Interface)AddMTU="9000".NOTE
${Interface}indicates thenetwork portname.
2. After theconfiguration iscomplete, restartthe networkservice.service network restart
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 67
Parameter Description Suggestion ConfigurationMethod
pid_max The defaultvalue ofpid_max is32768, which issufficient innormal cases.However, whenheavy workloadsare beingprocessed, thisvalue isinsufficient andmay causememoryallocationfailure.
Default value:32768Symptom: Runthe cat /proc/sys/kernel/pid_maxcommand to viewthe value.Suggestion: Setthe maximumnumber ofthreads that canbe generated inthe system to4194303.
Run the followingcommand:echo 4194303 > /proc/sys/kernel/pid_max
file_max Maximumnumber of filesthat can beopened by allprocesses in thesystem. Inaddition, someprograms cancall the setrlimitinterface to setthe limit on eachprocess. If thesystem generatesa large numberof errorsindicating thatfile handles areused up, increasethe value of thisparameter.
Default value:13291808Symptom: Runthe cat /proc/sys/fs/file-max command toview the value.Suggestion: Setthe maximumnumber of filesthat can beopened by allprocesses in thesystem to thevalue displayedafter the cat /proc/meminfo |grep MemTotal |awk '{print $2}'command is run.
Run the followingcommand:echo ${file-max} > /proc/sys/fs/file-max
NOTE${file-max} is thevalue displayed afterthe cat /proc/meminfo | grepMemTotal | awk'{print $2}' is run.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 68
Parameter Description Suggestion ConfigurationMethod
read_ahead Linux readaheadmeans that theLinux kernelprefetches acertain area ofthe specified fileand loads it intothe page cache.As a result, whenthe area isaccessedsubsequently,block caused bypage fault willnot occur.Reading datafrom memory ismuch faster thanreading datafrom drives.Therefore, thereadaheadfeature caneffectivelyreduce thenumber of driveseeks and theI/O waiting timeof theapplications. It isone of theimportantmethods foroptimizing thedrive read I/Operformance.
Default value:128 KBSymptom:Readahead caneffectively reducethe number ofdrive seeks andthe I/O waitingtime of theapplications. Runthe /sbin/blockdev --getra /dev/sdb toview the value.Suggestion:Change the valueto 8192 KB.Improve the driveread efficiency bypre-reading andrecording thedata to randomaccess memory(RAM).
Run the followingcommand:/sbin/blockdev --setra /dev/sdb
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 69
Parameter Description Suggestion ConfigurationMethod
I/O_Scheduler
The Linux I/Oscheduler is acomponent ofthe Linux kernel.You can adjustthe scheduler tooptimize systemperformance.
Default value:CFQSymptom: TheLinux I/Oscheduler needsto be configuredbased ondifferent storagedevices for theoptimal systemperformance.Suggestion: Setthe I/Oscheduling policyto deadline forHDDs and noopfor SSDs.
Run the followingcommand:echo deadline > /sys/block/sdb/queue/scheduler
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
nr_requests If the Linuxsystem receives alarge number ofread requests,the defaultnumber ofrequest queuesmay beinsufficient. Todeal with thisproblem, you candynamicallyadjust thedefault numberof requestqueues inthe /sys/block/hda/queue/nr_requests file.
Default value:128Symptom:Increase the drivethroughput byadjusting thenr_requestsparameter.Suggestion: Setthe number ofdrive requestqueues to 512.
Run the followingcommand:echo 512 > /sys/block/sdb/queue/nr_requests
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
3.3.3 Ceph Tuning
Modifying Ceph Configuration● Purpose
Adjust the Ceph configuration to maximize system resource usage.● Procedure
You can edit the /etc/ceph/ceph.conf file to modify all Ceph configurationparameters. For example, you can osd_pool_default_size = 4 to the /etc/
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 70
ceph/ceph.conf file to change the default number of copies to 4 and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.The preceding operations take effect only on the current Ceph node. You needto modify the ceph.conf file on all Ceph nodes and restart the Ceph daemonprocess for the modification to take effect on the entire Ceph cluster. Table3-15 describes the Ceph optimization items.
Table 3-15 Ceph parameter configuration
Parameter Description Suggestion
[global]
cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.
Recommended value:192.168.4.0/24You can set this parameteras required as long as it isdifferent from the publicnetwork segment.
public_network Recommended value:192.168.3.0/24You can set this parameteras required as long as it isdifferent from the clusternetwork segment.
osd_pool_default_size
Number of copies Recommended value: 3
osd_memory_target
Size of memory thateach OSD process isallowed to obtain
Recommended value:4294967296
For details about how to optimize other parameters, see Table 3-16.
Table 3-16 Other parameter configuration
Parameter Description Suggestion
[global]
osd_pool_default_min_size
Minimum number of I/Ocopies that the PG canreceive. If a PG is in thedegraded state, its I/Ocapability is notaffected.
Default value: 0Recommended value: 1
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 71
Parameter Description Suggestion
cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.
This parameter has nodefault value.Recommended value:192.168.4.0/24
osd_memory_target
Size of memory thateach OSD process isallowed to obtain
Default value: 4294967296Recommended value:4294967296
[mon]
mon_clock_drift_allowed
Clock drift betweenMONs
Default value: 0.05Recommended value: 1
mon_osd_min_down_reporters
Minimum down OSDquantity that triggers areport to the MONs
Default value: 2Recommended value: 13
mon_osd_down_out_interval
Number of seconds thatCeph waits before anOSD is marked as downor out
Default value: 600Recommended value: 600
[OSD]
osd_journal_size OSD journal size Default value: 5120Recommended value:20000
osd_max_write_size
Maximum size (in MB)of data that can bewritten by an OSD at atime
Default value: 90Recommended value: 512
osd_client_message_size_cap
Maximum size (in bytes)of data that can bestored in the memory bythe clients
Default value: 100Recommended value:2147483648
osd_deep_scrub_stride
Number of bytes thatcan be read during deepscrubbing
Default value: 524288Recommended value:131072
osd_map_cache_size
Size of the cache (inMB) that stores the OSDmap
Default value: 50Recommended value: 1024
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 72
Parameter Description Suggestion
osd_recovery_op_priority
Restoration priority. Thevalue ranges from 1 to63. A larger valueindicates higher resourceusage.
Default value: 3Recommended value: 2
osd_recovery_max_active
Number of activerestoration requests inthe same period
Default value: 3Recommended value: 10
osd_max_backfills Maximum number ofbackfills allowed by anOSD
Default value: 1Recommended value: 4
osd_min_pg_log_entries
Minimum number ofreserved PG logs
Default value: 3000Recommended value:30000
osd_max_pg_log_entries
Maximum number ofreserved PG logs
Default value: 3000Recommended value:100000
osd_mon_heartbeat_interval
Interval (in seconds) foran OSD to ping a MON
Default value: 30Recommended value: 40
ms_dispatch_throttle_bytes
Maximum number ofmessages to bedispatched
Default value: 104857600Recommended value:1048576000
objecter_inflight_ops
Allowed maximumnumber of unsent I/Orequests. This parameteris used for client trafficcontrol. If the number ofunsent I/O requestsexceeds the threshold,the application I/O isblocked. The value 0indicates that thenumber of unsent I/Orequests is not limited.
Default value: 1024Recommended value:819200
osd_op_log_threshold
Number of operationlogs to be displayed at atime
Default value: 5Recommended value: 50
osd_crush_chooseleaf_type
Bucket type when theCRUSH rule useschooseleaf
Default value: 1Recommended value: 0
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 73
Parameter Description Suggestion
journal_max_write_bytes
Maximum number ofjournal bytes that canbe written at a time
Default value: 10485760Recommended value:1073714824
journal_max_write_entries
Maximum number ofjournal records that canbe written at a time
Default value: 100Recommended value:10000
[Client]
rbd_cache RBD cache Default value: TrueRecommended value: True
rbd_cache_size RBD cache size (inbytes)
Default value: 33554432Recommended value:335544320
rbd_cache_max_dirty
Maximum number ofdirty bytes allowedwhen the cache is set tothe writeback mode. Ifthe value is 0, the cacheis set to thewritethrough mode.
Default value: 25165824Recommended value:134217728
rbd_cache_max_dirty_age
Duration (in seconds)for which the dirty datais stored in the cachebefore being flushed tothe drives
Default value: 1Recommended value: 30
rbd_cache_writethrough_until_flush
This parameter is usedfor compatibility withthe virtio driver earlierthan linux-2.6.32. Itprevents the situationthat data is written backwhen no flush request issent. After thisparameter is set, librbdprocesses I/Os inwritethrough mode. Themode is switched towriteback only after thefirst flush request isreceived.
Default value: TrueRecommended value: False
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 74
Parameter Description Suggestion
rbd_cache_max_dirty_object
Maximum number ofobjects. The defaultvalue is 0, whichindicates that thenumber is calculatedbased on the RBD cachesize. By default, librbdlogically splits the driveimage in a unit of 4 MB.Each chunk object isabstracted as an object.librbd manages thecache object. You canincrease the value of thisparameter improve theperformance.
Default value: 0Recommended value: 2
rbd_cache_target_dirty
Dirty data size thattriggers writeback. Thevalue cannot exceed thevalue ofrbd_cache_max_dirty.
Default value: 16777216Recommended value:235544320
Optimizing the PG Distribution● Purpose
Adjust the number of PGs on each OSD to balance the load on each OSD.● Procedure
By default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}pgp_num {pgp-num} command to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand is used to enable or disable the Ceph balancer function.Table 3-17 describes the PG distribution parameters.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 75
Table 3-17 PG distribution parameters
Parameter Description Suggestion
pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.
Default value: 8Symptom: A warning isdisplayed if the number ofPGs is insufficient.Suggestion: Calculate thevalue based on theformula.
pgp_num Set the number of PGPsto be the same as thatof PGs.
Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as the number ofPGs.Suggestion: Calculate thevalue based on theformula.
ceph_balancer_mode
Enable the balancerplug-in and set theplug-in mode toupmap.
Default value: noneSymptom: If the number ofPGs is unbalanced, someOSDs may be overloadedand become bottlenecks.Recommended value:upmap
NO TE
● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.
● Run the eph balancer mode upmap and ceph balancer on commands toautomatically balance and optimize Ceph PGs. Ceph adjusts the distribution of afew PGs every 60 seconds. Run the ceph balancer eval or ceph pg dumpcommand to view the PG distribution. If the PG distribution does not change, thedistribution is optimal.
● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs corresponding to each OSD, thedistribution of the primary PGs also needs to be optimized. That is, the primaryPGs need to be distributed to each OSD as evenly as possible.
Binding OSDs and RGWs to CPU Cores● Purpose
Bind the OSD and RGW processes to fixed CPU cores to prevent certain CPUcores from being overloaded.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 76
● ProcedureWhen NIC software interrupts and Ceph processes share CPUs under heavynetwork load, certain CPUs may be overloaded and become bottlenecks,compromising the Ceph cluster performance. To solve the problem, bind theNIC software interrupts and Ceph processes to different CPU cores. Table3-18 describes the parameters to be modified.
Table 3-18 Binding OSDs and RGWs to CPU cores
Parameter Description Suggestion
osd.[N] Binds the osd.n daemonprocess to a specifiedidle NUMA node, whichis a node other than thenodes that process NICsoftware interrupts.
Default value: noneSuggestion: Bind the osd.Ndaemon process to specifiedCPU cores that do not processNIC software interrupts toprevent the CPU frombecoming bottlenecks.
rgw.[N] Binds the RGW daemonprocess to a specifiedidle NUMA node, whichis a node other than thenodes that process theNIC software interrupt.
Default value: noneBind the rgw.N daemonprocess to CPU cores that dono process NIC softwareinterrupts to prevent the CPUfrom becoming thebottlenecks.
NO TE
The Ceph OSD/RGW daemon process and NIC software interrupt process must run ondifferent CPU cores. Otherwise, CPU bottlenecks may occur when the network load isheavy.
Run the following commands on all Ceph nodes to bind the CPU cores:for i in `ps -ef | grep rgw | grep -v grep | awk '{print $2}'`; do taskset -pc 4-47 $i; donefor i in `ps -ef | grep osd | grep -v grep | awk '{print $2}'`; do taskset -pc 4-47 $i; done
NO TE
Optimizing the Network Performance describes how to bind NIC software interruptsto the CPU core of the NUMA node to which the NIC belongs. When the network loadis heavy, the usage of the CPU core bound to the software interrupts is high.Therefore, you are advised to set osd_numa_node to a NUMA node different fromthat of the NIC. For example, run the cat /sys/class/net/<Port Name>/device/numa_node command to query the NUMA node of the NIC. If the NIC belongs toNUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 to prevent the OSDand NIC software interrupt from using the same CPU core. The core binding of theRGW is similar to that of the OSD. After finding an idle NUMA node, you can run thelscpu command to query the ID of the CPU core that corresponds to the NUMA node.In the preceding command line, 4-47 indicates the idle CPU core of the node. Changethe value as required.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 77
Optimizing Compression Algorithm Configuration Parameters● Purpose
Adjust the compression algorithm configuration parameters to optimize theperformance of the compression algorithm.
● Procedure
The default value of bluestore_min_alloc_size_hdd for Ceph is 32 KB. Thevalue of this parameter affects the size of the final data obtained after thecompression algorithm is run. Set this parameter to a smaller value tomaximize the compression ratio of the compression algorithm.
By default, Ceph uses five threads to process I/O requests in an OSD process.After the compression algorithm is enabled, the number of threads can causea performance bottleneck. Increase the number of threads to maximize theperformance of the compression algorithm.
The following table describes the placement group (PG) distributionparameter configuration:
Parameter Description Suggestion
bluestore_min_alloc_size_hdd
Minimum size of objectsallocated to the HDDdata disks in theBlueStore storageengine
Default value: 32768Recommended value: 8192
osd_op_num_shards_hdd
Number of shards for anHDD data disk in anOSD process
Default value: 5Recommended value: 12
osd_op_num_threads_per_shard_hdd
Average number ofthreads of an OSDprocess for each HDDdata disk shard
Default value: 1Recommended value: 2
Using the I/O Passthrough Tool for Optimization
The I/O passthrough tool is a process optimization tool for balanced scenarios ofthe Ceph cluster. It can automatically detect and optimize OSDs in the Cephcluster. For details on how to use this tool, see the I/O Passthrough Tool UserGuide.
3.3.4 KAE zlib Compression Tuning● Purpose
Optimize zlib compression to maximize the CPU capability of processing OSDprocesses and maximize the hardware performance.
● Procedure
Enable the hardware acceleration engine to implement zlib compression.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 78
Preparing the EnvironmentNO TE
Before installing the accelerator engine, you need to apply for and install a license.
License application guide:
https://support.huawei.com/enterprise/zh/doc/EDOC1100068122/b9878159
Installation guide:
https://support.huawei.com/enterprise/en/doc/EDOC1100048786/ba20dd15
Download the acceleration engine installation package and developer Guide.
Download link: https://github.com/kunpengcompute/KAE/tags
Installing the Acceleration EngineNO TE
The developer guide describes how to install and use all modules of the accelerator engine.Select an appropriate installation mode based on the developer guide.
For details, see Installing the KAE Software Package Using Source Code.
Step 1 Install the acceleration engine according to the developer guide.
Step 2 Install the zlib library.
1. Download KAEzip.2. Download zlib-1.2.11.tar.gz from the zlib official website and copy it to
KAEzip/open_source.3. Perform the compilation and installation.
cd KAEzipsh setup.sh install
The zlib library is installed in /usr/local/kaezip.
Step 3 Back up the connection.mv /lib64/libz.so.1 /lib64/libz.so.1-bak
Step 4 Replace the zlib software compression algorithm dynamic library.cd /usr/local/kaezip/libcp libz.so.1.2.11 /lib64/mv /lib64/libz.so.1 /lib64/libz.so.1-bakln -s /lib64/libz.so.1.2.11 /lib64/libz.so.1
NO TE
In the cd /usr/local/zlib command, /usr/local/zlib indicates the zlib installation path.Change it as required.
----End
NO TE
If the Ceph cluster is running before the dynamic library is replaced, run the followingcommand on all storage nodes to restart the OSDs for the change to take effect after thedynamic library is replaced:systemctl restart ceph-osd.target
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 79
Changing the Default Number of Accelerator QueuesNO TE
The default number of queues of the hardware accelerator is 256. To fully utilize theperformance of the accelerator, change the number of queues to 512 or 1024.
Step 1 Remove hisi_zip.rmmod hisi_zip
Step 2 Set the default accelerator queue parameter pf_q_num=512.vi /etc/modprobe.d/hisi_zip.confoptions hisi_zip uacce_mode=2 pf_q_num=512
Step 3 Load hisi_zip.modprobe hisi_zip
Step 4 Check the hardware accelerator queue.cat /sys/class/uacce/hisi_zip-*/attrs/available_instances
The change is successful if the following information is displayed.
Step 5 Check the dynamic library links. If libwd.so.1 is contained in the command output,the operation is successful.ldd /lib64/libz.so.1
----End
Adapting Ceph to the AcceleratorNO TE
Currently, the mainline Ceph versions allow configuring the zlib compression mode usingthe configuration file. The released Ceph versions (by version 15.2.3) adopt the zlibcompression mode without data headers or trailers. However, the current hardwareacceleration library supports only the mode with data headers and trailers. Therefore, theCeph source code needs to be modified to adapt to the Kunpeng hardware accelerationlibrary. For details about the modification method, see the latest patch that has beenincorporated into the mainline versions.
https://github.com/ceph/ceph/pull/34852
The following uses Ceph 14.2.8 as an example to describe how to adapt Ceph to the zlibcompression engine.
Step 1 Obtain the source code.
URL: https://download.ceph.com/tarballs/
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 80
After the source code package is downloaded, save it to the /home directory onthe server.
Step 2 Obtain the patch and save it to the /home directory.
https://mirrors.huaweicloud.com/kunpeng/archive/kunpeng_solution/storage/Patch/
Step 3 Go to the /home directory, decompress the source code package and go to thedirectory generated after decompression.cd /home && tar -zxvf ceph-14.2.8.tar.gz && cd ceph-14.2.8/
Step 4 Apply the patch in the source code directory.patch -p1 < ../ceph-14.2.8-zlib-compress.patch
Step 5 After modifying the source code, compile Ceph.● CentOS: See Ceph 14.2.1 Porting Guide (CentOS 7.6).● openEuler: See Ceph 14.2.8 Porting Guide (openEuler 20.03).
Step 6 Install Ceph.
Step 7 Modify the ceph.conf file to configure the zlib compression mode.vi /etc/ceph/ceph.confcompressor_zlib_winsize=15
Step 8 Modify the systemd permissions. In Ceph, the RGW process service is managed bysystemd. To enable systemd to access hardware acceleration devices, you need tomodify the configuration on each RGW node as follows:
1. Open the configuration file.vi /usr/lib/systemd/system/[email protected]
Change PrivateDevices=yes to PrivateDevices=no.2. Make the modification take effect.
systemctl daemon-reload
Step 9 Restart the Ceph cluster for the configuration to take effect.ceph daemon osd.0 config show|grep compressor_zlib_winsize
----End
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 81
3.4 High-Performance Storage
3.4.1 Hardware TuningHigh-performance configuration tuning
● PurposeBalance the loads of the two CPUs.
● ProcedureEvenly distribute the NVMe SSDs and NICs to the two CPUs.
HardwareType
Tuning Method Remarks
NIC NUMA resourcebalancing
For example, you can insert the LOMsinto the PCIe slots of CPU 1 and theMellanox ConnectX-4 NICs into the PCIeslots of CPU 2 to balance the loads ofthe two CPUs.
Storage NUMA resourcebalancing
For example, you can insert six NVMeSSDs into the PCIe slots of CPU 1 andthe other six NVMe SSDs into the PCIeslots of CPU 2 to balance the loads ofthe two CPUs.
3.4.2 Ceph Tuning● Purpose
Adjust the Ceph configuration items to fully utilize the hardware performanceof the system.
● ProcedureYou can edit the /etc/ceph/ceph.conf file to modify all Ceph configurationparameters.For example, to change the number of copies to 4, you can addosd_pool_default_size = 4 to the /etc/ceph/ceph.conf file and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.The setting takes effect only for the current Ceph node. To enable the settingsof the entire Ceph cluster to take effect, you need to modify the ceph.conffile of each Ceph node and restart the Ceph daemon process.Table 3-19 lists the optimization items.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 82
Table 3-19 Ceph parameters
Parameter Description Suggestion
[global]
osd_pool_default_min_size
Specifies the minimumnumber of I/O copiesthat the PG can receive.If a PG is in thedegraded state, its I/Ocapability is notaffected.
Default value: 0Recommended value: 1
cluster_network Configures a networksegment different fromthe public network. Thisnetwork segment isused for replication anddata balancing betweenOSDs to relieve thepressure on the publicnetwork.
Default value: noneRecommended value:192.168.4.0/24
osd_pool_default_size
Specifies the number ofcopies.
Default value: 3Recommended value: 3
mon_max_pg_per_osd
Indicates the PG alarmthreshold. You canincrease the value forbetter performance.
Default value: 250Recommended value: 3000
[rgw]
rgw_override_bucket_index_max_shards
Specifies the number ofshards per bucket index.The value 0 indicatesthat no shard isavailable.
Default value: 0Recommended value: 8
PG Distribution Tuning● Purpose
Adjust the number of PGs on each OSD to balance the load on each OSD.● Procedure
By default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}pgp_num {pgp-num} commands to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 83
The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand to enable or disable the Ceph balancer function.Table 3-20 describes the PG distribution parameters.
Table 3-20 PG distribution parameters
Parameter Description Suggestion
pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.
Default value: 8Symptom: A warning isdisplayed if PGs areinsufficient.Suggestion: Calculate thevalue based on theformula.
pgp_num Sets the number ofPGPs to be the same asthat of PGs.
Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as that of PGs.Suggestion: Calculate thevalue based on theformula.
ceph_balancer_mode
Enables the balancerplug-in and sets theplug-in mode toupmap.
Default value: noneSymptom: If PGs are notevenly distributed acrossOSDs, some OSDs may beoverloaded and becomebottlenecks.Suggestion: Set thisparameter to upmap.
NO TE
● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.
● Run the eph balancer mode upmap and ceph balancer on commands toautomatically optimize Ceph PG distribution. Ceph adjusts the distribution of a fewPGs every 60 seconds. Run the ceph balancer eval or ceph pg dump command toview the PG distribution. If the PG distribution does not change, the distribution isoptimal.
● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs carried by each OSD, the distribution ofthe primary PGs also needs to be optimized. That is, the primary PGs need to bedistributed to each OSD as evenly as possible.
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 84
3.4.3 KAE MD5 Digest Algorithm Tuning● Purpose
Optimize the MD5 calculation process when the RGW writes objects tomaximize the CPU capability of processing the RGW process and maximizethe hardware performance.
● ProcedureEnable the Kunpeng Accelerator Engine (KAE) to implement MD5 calculation.
Environment PreparationsNO TE
Before installing the KAE, you need to apply for and install a license.License application guide:https://support.huawei.com/enterprise/zh/doc/EDOC1100068122/b9878159Installation guide:https://support.huawei.com/enterprise/zh/doc/EDOC1100048792/ba20dd15
Download the acceleration engine installation package and developer Guide.
URL: https://github.com/kunpengcompute/KAE/releases/tag/v1.3.6-bata
Installing the Accelerator EngineNO TE
● The developer guide describes how to install and use all modules of the acceleratorengine. After reading the guide, select an appropriate installation mode.
● If the CentOS 7.6 and OpenSSL 1.0.2k are used, you need to download thecorresponding libkae engine software package. The installation method of the packageis the same as that in the developer guide.
Install the accelerator engine as instructed by the developer guide.
Changing the Default Number of Accelerator QueuesNO TE
The default number of hardware accelerator queues is 256. When the number ofaccelerator queues remaining (which can be obtained in Step 4) is 0 or a small valueduring service running, you can change the number of queues to 512 or 1024 to maximizethe accelerator performance.
Step 1 Uninstall hisi_sec2.rmmod hisi_sec2
Step 2 Set the default accelerator queue parameter pf_q_num=512.vi /etc/modprobe.d/hisi_sec2.confoptions hisi_sec2 uacce_mode=2 enable_sm4_ctr=1 pf_q_num=512
Step 3 Load hisi_sec2.modprobe hisi_sec2
Step 4 Check the hardware accelerator queue.cat /sys/class/uacce/hisi_sec2-*/attrs/available_instances
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 85
The change is successful if the following information is displayed.
----End
Adapting Ceph to the AcceleratorNO TE
Currently, the Ceph mainline version supports the configuration of the OpenSSL externalengine using the configuration file. However, this feature is not available in the releasedCeph versions (by v15.2.3). To use MD5 hardware acceleration in these versions, you needto modify the Ceph source code. For details about the modification method, see the latestpatch that has been incorporated into the mainline versions.https://github.com/ceph/ceph/pull/33964/The following uses Ceph v14.2.8 as an example to describe how to adapt Ceph to MD5hardware acceleration.
Step 1 Obtain the source code.
URL: https://download.ceph.com/tarballs/
After the source code package is downloaded, save it to the /home directory onthe server.
Step 2 Obtain the adaptation patch. Download the patch that enables the OpenSSLexternal engine of Ceph v14.2.8. After downloading the patch, save it to the /home directory.
Download the patch at https://mirrors.huaweicloud.com/kunpeng/archive/kunpeng_solution/storage/Patch/.
Step 3 Go to the /home directory, decompress the source code package and go to thedirectory generated after decompression.cd /home && tar -zxvf ceph-14.2.8.tar.gz && cd ceph-14.2.8/
Step 4 Apply the patch in the root directory of the source code.patch -p1 < ../ceph-14.2.8-common-rgw-add-openssl-engine-support.patch
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 86
Step 5 After modifying the source code, compile Ceph.● CentOS: See Ceph 14.2.1 Porting Guide (CentOS 7.6).● openEuler: See Ceph 14.2.8 Porting Guide (openEuler 20.03).
Step 6 Install Ceph.
Step 7 Modify the systemd permissions. In Ceph, the RGW process service is managed bysystemd. To enable systemd to access hardware acceleration devices, you need tomodify the configuration on each RGW node as follows:
1. Open the configuration file.vi /usr/lib/systemd/system/[email protected]
Change PrivateDevices=yes to PrivateDevices=no.2. Make the modification take effect.
systemctl daemon-reload
Step 8 Use KAE to accelerate RGW digest computing. Add the following OpenSSL engineoptions to the global section in the ceph.conf file on the node where RGW isdeployed:openssl_engine_opts = "engine_id=kae,dynamic_path=/usr/local/lib/engines-1.1/libkae.so,KAE_CMD_ENABLE_ASYNC=0,default_algorithms=DIGESTS,init=1"
dynamic_path is the default installation path of libkae.so in the libkae enginesoftware package. default_algorithms=DIGESTS indicates that this engine is usedfor digest algorithms including MD5. After completing the configuration,synchronize the configuration to all nodes where the RGW is deployed.
Step 9 Restart RGW for the settings to take effect. Run the following command on eachnode where the RGW is deployed:systemctl restart ceph-radosgw.target
----End
Kunpeng BoostKit for SDSTuning Guide 3 Ceph Object Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 87
4 Ceph File Storage Tuning Guide
4.1 Introduction
4.2 General-Purpose Storage
4.1 Introduction
4.1.1 Components
CephCeph is a distributed, scalable, reliable, and high-performance storage systemplatform that supports storage interfaces including block devices, file systems, andobject gateways. The optimization methods described in this document includehardware optimization and software configuration optimization. Software codeoptimization is not involved. By adjusting the system and Ceph configurationparameters, Ceph can fully utilize the hardware performance of the system. CephPlacement Group (PG) distribution optimization and object storage daemon(OSD) core binding aim to balance drive loads and prevent any OSD frombecoming a bottleneck. In addition, in general-purpose storage scenarios, usingNVMe SSDs as Bcache can also improve performance. Figure 4-1 shows the Cepharchitecture.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 88
Figure 4-1 Ceph architecture
Table 4-1 describes the Ceph modules and components.
Table 4-1 Module functions
Module Function
RADOS Reliable Autonomic Distributed Object Store (RADOS) is theheart of a Ceph storage cluster. Everything in Ceph is stored byRADOS in the form of objects irrespective of their data types. TheRADOS layer ensures data consistency and reliability throughdata replication, fault detection and recovery, and data recoveryacross cluster nodes.
OSD Object storage daemons (OSDs) store the actual user data. EveryOSD is usually bound to one physical drive. The OSDs handle theread/write requests from clients.
MON The monitor (MON) is the most important component in a Cephcluster. It manages the Ceph cluster and maintains the status ofthe entire cluster. The MON ensures that related components ofa cluster can be synchronized at the same time. It functions asthe leader of the cluster and is responsible for collecting,updating, and publishing cluster information. To avoid singlepoints of failure (SPOFs), multiple MONs are deployed in a Cephenvironment, and they must handle the collaboration betweenthem.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 89
Module Function
MGR The manager (MGR) is a monitoring system that providescollection, storage, analysis (including alarming), andvisualization functions. It makes certain cluster parametersavailable for external systems.
Librados Librados is a method that simplifies access to RADOS. Currently,it supports programming languages PHP, Ruby, Java, Python, C,and C++. It provides RADOS, a local interface of the Ceph storagecluster, and is the base component of other services such as theRADOS block device (RBD) and RADOS gateway (RGW). Inaddition, it provides the Portable Operating System Interface(POSIX) for the Ceph file system (CephFS). The Librados API canbe used to directly access RADOS, enabling developers to createtheir own interfaces for accessing the Ceph cluster storage.
RBD The RADOS block device (RBD) is the Ceph block device thatprovides block storage for external systems. It can be mapped,formatted, and mounted like a drive to a server.
RGW The RADOS gateway (RGW) is a Ceph object gateway thatprovides RESTful APIs compatible with S3 and Swift. The RGWalso supports multi-tenant and OpenStack Identity service(Keystone).
MDS The Ceph Metadata Server (MDS) tracks the file hierarchy andstores metadata used only for CephFS. The RBD and RGW do notrequire metadata. The MDS does not directly provide dataservices for clients.
CephFS The CephFS provides a POSlX-compatible distributed file systemof any size. It depends on the Ceph MDS to track the filehierarchy, namely the metadata.
Vdbench
Vdbench is a command line utility designed to help engineers and customersgenerate drive I/O loads for verifying storage performance and data integrity. Youcan also specify Vdbench execution parameters by entering text files.
Vdbench has many parameters. Table 4-2 lists some important commonparameters.
Table 4-2 Common parameters
Parameter
Description
-f Specifies a script file for the pressure test.
-o Specifies the path for exporting a report. The default value is thecurrent path.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 90
Parameter
Description
lun Specifies the LUN device or file to be tested.
size Specifies the size of the LUN device or file to be tested.
rdpct Specifies the read percentage. The value 100 indicates full read, andthe value 0 indicates full write.
seekpct Specifies the percentage of random data. The value 100 indicates allrandom data, and the value 0 indicates sequential data.
elapsed Specifies the duration of the current test.
4.1.2 Environment
Physical NetworkingThe physical environment of the Ceph block devices contains two network layersand three nodes. In the physical environment, the MON, MGR, MDS, and OSDnodes are deployed together. At the network layer, the public network is separatedfrom the cluster network. The two networks use 25GE optical ports forcommunication.
Figure 4-2 shows the physical network.
Figure 4-2 Physical networking
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 91
Hardware Configuration
Table 4-3 shows the Ceph hardware configuration.
Table 4-3 Hardware configuration
Server TaiShan 200 server (model 2280)
Processor Kunpeng 920 5230 processor
Core 2 x 32-core
CPU frequency 2600 MHz
Memory capacity 12 x 16 GB
Memory frequency 2666 MHz (8 Micron 2R memory modules)
NIC IN200 NIC (4 x 25GE)
Drive System drives: RAID 1 (2 x 960 GB SATA SSDs)Data drives of general-purpose storage: JBOD enabledin RAID mode (12 x 4 TB SATA HDDs)
NVMe SSD Acceleration drive of general-purpose storage: 1 x 3.2TB ES3600P V5 NVMe SSDData drives of high-performance storage: 12 x 3.2 TBES3600P V5 NVMe SSDs
RAID controller card Avago SAS 3508
Software Versions
Table 4-4 lists the required software versions.
Table 4-4 Software versions
Software Version
OS CentOS Linux release 7.6.1810
openEuler 20.03 LTS SP1
Ceph 14.2.x Nautilus
ceph-deploy 2.0.1
Vdbench 5.04.06
Node Information
Table 4-5 describes the IP network segment planning of the hosts.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 92
Table 4-5 Node information
Host Type HostName
Public NetworkSegment
Cluster NetworkSegment
OSD/MON node Node 1 192.168.3.0/24 192.168.4.0/24
OSD/MGR node Node 2 192.168.3.0/24 192.168.4.0/24
OSD/MDS node Node 3 192.168.3.0/24 192.168.4.0/24
Component DeploymentTable 4-6 describes the deployment of service components in the Ceph blockdevice cluster.
Table 4-6 Component deployment
PhysicalMachineName
OSD MON MGR MDS
Node 1 12 1 1 1
Node 2 12 1 1 1
Node 3 12 1 1 1
Cluster CheckRun the ceph health command to check the cluster health status. If HEALTH_OKis displayed, the cluster is running properly.
4.1.3 Tuning Guidelines and Process Flow
Tuning GuidelinesPerformance optimization must comply with the following principles:
● When analyzing the performance, analyze the system resource bottlenecksfrom multiple aspects. For example, insufficient memory capacity may causethe CPU to be occupied by memory scheduling tasks and the CPU usage toreach 100%.
● Adjust only one performance parameter at a time.● The analysis tool may consume system resources and aggravate certain
system resource bottlenecks. Therefore, the impact on applications must beavoided or minimized.
Tuning Process FlowThe tuning analysis flow is as follows:
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 93
1. In many cases, pressure test traffic is not completely sent to the backend(server). For example, a protection policy may be triggered on network accesslayer services such as Server Load Balancing (SLB), Web Application Firewall(WAF), High Defense IP, and even Content Delivery Network (CDN) /siteacceleration in a cloud-based architecture. This occurs because thespecifications, such as bandwidth, maximum number of connections, andnumber of new connections, are limited, or the pressure test shows thefeatures of Challenge Collapsar (CC) and Distributed Denial of Service (DDoS)attacks. As a result, the pressure test results do not meet expectations.
2. Check whether the key indicators meet the requirements. If not, locate thefault. The fault may be caused by the servers (in most cases) or the clients (ina few cases).
3. If the problem is caused by the servers, focus on the hardware indicators suchas the CPU, memory, drive I/O, and network I/O. Locate the fault and performfurther analysis on the abnormal hardware indicator.
4. If all hardware indicators are normal, check the middleware indicators such asthe thread pool, connection pool, and GC indicators. Perform further analysisbased on the abnormal middleware indicator.
5. If all middleware indicators are normal, check the database indicators such asthe slow query SQL indicators, hit ratio, locks, and parameter settings.
6. If the preceding indicators are normal, the algorithm, buffer, cache,synchronization, or asynchronization of the applications may be faulty.Perform further analysis.
Table 4-7 lists the possible bottlenecks.
Table 4-7 Possible bottlenecks
Bottleneck
Description
Hardware/Specifications
Problems of the CPU, memory, and drive I/O. The problems areclassified into server hardware bottlenecks and network bottlenecks(Network bottlenecks can be ignored in a LAN).
Middleware
Problems of software such as application servers and web servers, anddatabase systems. For example, a bottleneck may be caused ifparameters of the Java Database Connectivity (JDBC) connection poolconfigured on the WebLogic platform are set improperly.
Applications
Problems related to applications developed by developers. Forexample, when the system receives a large number of user requests,the following problems may cause low system performance, includingslow SQL statements and improper Java Virtual Machine (JVM)parameters, container settings, database design, program architectureplanning, and program design (insufficient threads for serial processingand request processing, no buffer, no cache, and mismatch betweenproducers and consumers).
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 94
Bottleneck
Description
OS Problems related to the OS such as Windows, UNIX, or Linux. Forexample, if the physical memory capacity is insufficient and the virtualmemory capacity is improper during a performance test, the virtualmemory swap efficiency may be greatly reduced. As a result, theresponse time is increased. This bottleneck is caused by the OS.
Networkdevices
Problems related to devices such as the firewalls, dynamic loadbalancers, and switches. Currently, more network access products areused in the cloud service architecture, including but not limited to theSLB, WAF, High Defense IP, CDN, and site acceleration. For example, ifa dynamic load distribution mechanism is set on the dynamic loadbalancer, the dynamic load balancer automatically sends subsequenttransaction requests to low-load servers when the hardware resourceusage of a server reaches the limit. If the dynamic load balancer doesnot function as expected in the test, the problem is a networkbottleneck.
General tuning procedure:
Figure 4-3 shows the general tuning procedure.
Figure 4-3 General tuning procedure
4.2 General-Purpose Storage
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 95
4.2.1 Hardware Tuning
NVMe SSD Tuning● Purpose
Reduce cross-chip data overheads.
● Procedure
Install the NVMe SSDs and NIC into the same riser card.
DIMM Installation Mode Tuning● Purpose
Populate one DIMM Per Channel (1DPC) to maximize the memoryperformance. That is, populate the DIMM 0 slot of each channel.
● Procedure
Preferentially populate DIMM 0 slots (DIMM 000, 010, 020, 030, 040, 050,100, 110, 120, 130, 140, and 150). Of the three digits in the DIMM slotnumber, the first digit indicates the CPU, the second digit indicates the DIMMchannel, and the third digit indicates the DIMM. Populate the DIMM slotswhose third digit is 0 in ascending order.
4.2.2 System Tuning
Optimizing the OS Configuration● Purpose
Adjust the system configuration to maximize the hardware performance.
● Procedure
Table 4-8 lists the optimization items.
Table 4-8 OS configuration parameters
Parameter Description Suggestion ConfigurationMethod
vm.swappiness
The swappartition is thevirtual memoryof the system.Do not use theswap partitionbecause it willdeterioratesystemperformance.
Default value: 60Symptom: Theperformancedeterioratessignificantly whenthe swappartition is used.Suggestion:Disable the swappartition and setthis parameter to0.
Run the followingcommand:sudo sysctl vm.swappiness=0
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 96
Parameter Description Suggestion ConfigurationMethod
MTU Maximum size ofa data packetthat can passthrough a NIC.After the value isincreased, thenumber ofnetwork packetscan be reducedand theefficiency can beimproved.
Default value:1500 bytesSymptom: Runthe ip addrcommand to viewthe value.Suggestion: Setthe maximumsize of a datapacket that canpass through aNIC to 9000bytes.
1. Run thefollowingcommand:vi /etc/sysconfig/network-scripts/ifcfg-$(Interface)AddMTU="9000".NOTE
${Interface}indicates thenetwork portname.
2. After theconfiguration iscomplete, restartthe networkservice.service network restart
pid_max The defaultvalue ofpid_max is32768, which issufficient innormal cases.However, whenheavy workloadsare beingprocessed, thisvalue isinsufficient andmay causememoryallocationfailure.
Default value:32768Symptom: Runthe cat /proc/sys/kernel/pid_maxcommand to viewthe value.Suggestion: Setthe maximumnumber ofthreads that canbe generated inthe system to4194303.
Run the followingcommand:echo 4194303 > /proc/sys/kernel/pid_max
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 97
Parameter Description Suggestion ConfigurationMethod
file_max Maximumnumber of filesthat can beopened by allprocesses in thesystem. Inaddition, someprograms cancall the setrlimitinterface to setthe limit on eachprocess. If thesystem generatesa large numberof errorsindicating thatfile handles areused up, increasethe value of thisparameter.
Default value:13291808Symptom: Runthe cat /proc/sys/fs/file-max command toview the value.Suggestion: Setthe maximumnumber of filesthat can beopened by allprocesses in thesystem to thevalue displayedafter the cat /proc/meminfo |grep MemTotal |awk '{print $2}'command is run.
Run the followingcommand:echo ${file-max} > /proc/sys/fs/file-max
NOTE${file-max} is thevalue displayed afterthe cat /proc/meminfo | grepMemTotal | awk'{print $2}' is run.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 98
Parameter Description Suggestion ConfigurationMethod
read_ahead Linux readaheadmeans that theLinux kernelprefetches acertain area ofthe specified fileand loads it intothe page cache.As a result, whenthe area isaccessedsubsequently,block caused bypage fault willnot occur.Reading datafrom memory ismuch faster thanreading datafrom drives.Therefore, thereadaheadfeature caneffectivelyreduce thenumber of driveseeks and theI/O waiting timeof theapplications. It isone of theimportantmethods foroptimizing thedrive read I/Operformance.
Default value:128 KBSymptom:Readahead caneffectively reducethe number ofdrive seeks andthe I/O waitingtime of theapplications. Runthe /sbin/blockdev --getra /dev/sdb toview the value.Suggestion:Change the valueto 8192 KB.Improve the driveread efficiency bypre-reading andrecording thedata to randomaccess memory(RAM).
Run the followingcommand:/sbin/blockdev --setra /dev/sdb
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 99
Parameter Description Suggestion ConfigurationMethod
I/O_Scheduler
The Linux I/Oscheduler is acomponent ofthe Linux kernel.You can adjustthe scheduler tooptimize systemperformance.
Default value:CFQSymptom: TheLinux I/Oscheduler needsto be configuredbased ondifferent storagedevices for theoptimal systemperformance.Suggestion: Setthe I/Oscheduling policyto deadline forHDDs and noopfor SSDs.
Run the followingcommand:echo deadline > /sys/block/sdb/queue/scheduler
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
nr_requests If the Linuxsystem receives alarge number ofread requests,the defaultnumber ofrequest queuesmay beinsufficient. Todeal with thisproblem, you candynamicallyadjust thedefault numberof requestqueues inthe /sys/block/hda/queue/nr_requests file.
Default value:128Symptom:Increase the drivethroughput byadjusting thenr_requestsparameter.Suggestion: Setthe number ofdrive requestqueues to 512.
Run the followingcommand:echo 512 > /sys/block/sdb/queue/nr_requests
NOTE/dev/sdb is used asan example. Youneed to modify thisparameter for alldata drives.
Optimizing the Network Performance● Purpose
This test uses the 25GE Ethernet adapter (Hi1822) with four ports, SFP+. It isused as an example to describe how to optimize the NIC parameters for theoptimal performance.
● Procedure
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 100
The optimization methods include adjusting NIC parameters and interrupt-core binding (binding interrupts to the physical CPU of the NIC). Table 4-9describes the optimization items.
Table 4-9 NIC parameters
Parameter Description Suggestion
irqbalance System interruptbalancing service, whichautomatically allocatesNIC software interrupts toidle CPUs.
Default value: activeSymptom: When thisfunction is enabled, thesystem automaticallyallocates NIC softwareinterrupts to idle CPUs.Suggestion:● To disable irqbalance,
set this parameter toinactive.systemctl stop irqbalance
● Keep the functiondisabled after theserver is restarted.systemctl disable irqbalance
rx_buff Aggregation of largenetwork packets requiresmultiple discontinuousmemory pages and causeslow memory usage. Youcan increase the value ofthis parameter to improvethe memory usage.
Default value: 2Symptom: When the valueis set to 2 by default,interrupts consume alarge number of CPUresources.Suggestion: Load therx_buff parameter and setthe value to 8 to reducediscontinuous memoryand improve memoryusage and performance.For details, see descriptionfollowing the table.
ring_buffer You can increase thethroughput by adjustingthe NIC buffer size.
Default value: 1024Symptom: Run theethtool -g NIC namecommand to view thevalue.Suggestion: Change thering_buffer queue size to4096. For details, seedescription following thetable.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 101
Parameter Description Suggestion
lro lro indicates large receiveoffload. After this functionis enabled, multiple smallpackets are aggregatedinto one large packet forbetter efficiency.
Default value: offSymptom: After thisfunction is enabled, themaximum throughputincreases significantly.Suggestion: Enable thelarge-receive-offloadfunction to help networksimprove the efficiency ofsending and receivingpackets. For details, seedescription following thetable.
hinicadm_lro_-ihinic0_-t_<NUM>
Received aggregatedpackets are sent after thetime specified by NUM (inmicroseconds). You canset the value to 256microseconds for betterefficiency.
Default value: 16microsecondsSymptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 256microseconds.
hinicadm_lro_-i_hinic0_-n_<NUM>
Received aggregatedpackets are sent after thenumber of aggregatedpackets reaches the valuespecified by <NUM>. Youcan set the value to 32 forbetter efficiency.
Default value: 4Symptom: This parameteris used with the LROfunction.Suggestion: Change thevalue to 32.
– Adjusting rx_buff
i. Go to the /etc/modprobe.d directory.cd /etc/modprobe.d
ii. Create the hinic.conf file.vi /etc/modprobe.d/hinic.conf
Add the following information to the file:options hinic rx_buff=8
iii. Reload the driver.rmmod hinicmodprobe hinic
iv. Check whether the value of rx_buff is changed to 8.cat /sys/bus/pci/drivers/hinic/module/parameters/rx_buff
– Adjusting ring_buffer
i. Change the buffer size from the default value 1024 to 4096.ethtool -G <NIC name> rx 4096 tx 4096
ii. Check the current buffer size.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 102
ethtool -g <NIC name>
– Enabling LRO
i. Enable the LRO function for a NIC.ethtool -K <NIC name> lro on
ii. Check whether the function is enabled.ethtool -k <NIC name> | grep large-receive-offload
NO TE
In addition to optimizing the preceding parameters, you need to bind the NIC softwareinterrupts to the cores.
1. Disable the irqbalance service.
2. Query the NUMA node to which the NIC belongs:cat /sys/class/net/<Network port name>/device/numa_node
3. Query the CPU cores that correspond to the NUMA node.lscpu
4. Query the interrupt ID corresponding to the NIC.cat /proc/interrupts | grep <Network port name> | awk -F ':' '{print $1}'
5. Bind the software interrupt to the core corresponding to the NUMA node.echo <core number> > /proc/irq/ <Interrupt ID> /smp_affinity_list
4.2.3 Ceph Tuning
Modifying Ceph Configuration● Purpose
Adjust the Ceph configuration to maximize system resource usage.● Procedure
You can edit the /etc/ceph/ceph.conf file to modify all Ceph configurationparameters. For example, you can osd_pool_default_size = 4 to the /etc/ceph/ceph.conf file to change the default number of copies to 4 and run thesystemctl restart ceph.target command to restart the Ceph daemon processfor the change to take effect.The preceding operations take effect only on the current Ceph node. You needto modify the ceph.conf file on all Ceph nodes and restart the Ceph daemonprocess for the modification to take effect on the entire Ceph cluster. Table4-10 describes the Ceph optimization items.
Table 4-10 Ceph parameter configuration
Parameter Description Suggestion
[global]
cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.
Recommended value:192.168.4.0/24You can set this parameteras required as long as it isdifferent from the publicnetwork segment.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 103
Parameter Description Suggestion
public_network Recommended value:192.168.3.0/24You can set this parameteras required as long as it isdifferent from the clusternetwork segment.
osd_pool_default_size
Number of copies Recommended value: 3
osd_memory_target
Size of memory thateach OSD process isallowed to obtain
Recommended value:4294967296
For details about how to optimize other parameters, see Table 4-11.
Table 4-11 Other parameter configuration
Parameter Description Suggestion
[global]
osd_pool_default_min_size
Minimum number of I/Ocopies that the PG canreceive. If a PG is in thedegraded state, its I/Ocapability is notaffected.
Default value: 0Recommended value: 1
cluster_network You can configure anetwork segmentdifferent from the publicnetwork for OSDreplication and databalancing to relieve thepressure on the publicnetwork.
This parameter has nodefault value.Recommended value:192.168.4.0/24
osd_memory_target
Size of memory thateach OSD process isallowed to obtain
Default value: 4294967296Recommended value:4294967296
[mon]
mon_clock_drift_allowed
Clock drift betweenMONs
Default value: 0.05Recommended value: 1
mon_osd_min_down_reporters
Minimum down OSDquantity that triggers areport to the MONs
Default value: 2Recommended value: 13
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 104
Parameter Description Suggestion
mon_osd_down_out_interval
Number of seconds thatCeph waits before anOSD is marked as downor out
Default value: 600Recommended value: 600
[OSD]
osd_journal_size OSD journal size Default value: 5120Recommended value:20000
osd_max_write_size
Maximum size (in MB)of data that can bewritten by an OSD at atime
Default value: 90Recommended value: 512
osd_client_message_size_cap
Maximum size (in bytes)of data that can bestored in the memory bythe clients
Default value: 100Recommended value:2147483648
osd_deep_scrub_stride
Number of bytes thatcan be read during deepscrubbing
Default value: 524288Recommended value:131072
osd_map_cache_size
Size of the cache (inMB) that stores the OSDmap
Default value: 50Recommended value: 1024
osd_recovery_op_priority
Restoration priority. Thevalue ranges from 1 to63. A larger valueindicates higher resourceusage.
Default value: 3Recommended value: 2
osd_recovery_max_active
Number of activerestoration requests inthe same period
Default value: 3Recommended value: 10
osd_max_backfills Maximum number ofbackfills allowed by anOSD
Default value: 1Recommended value: 4
osd_min_pg_log_entries
Minimum number ofreserved PG logs
Default value: 3000Recommended value:30000
osd_max_pg_log_entries
Maximum number ofreserved PG logs
Default value: 3000Recommended value:100000
osd_mon_heartbeat_interval
Interval (in seconds) foran OSD to ping a MON
Default value: 30Recommended value: 40
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 105
Parameter Description Suggestion
ms_dispatch_throttle_bytes
Maximum number ofmessages to bedispatched
Default value: 104857600Recommended value:1048576000
objecter_inflight_ops
Allowed maximumnumber of unsent I/Orequests. This parameteris used for client trafficcontrol. If the number ofunsent I/O requestsexceeds the threshold,the application I/O isblocked. The value 0indicates that thenumber of unsent I/Orequests is not limited.
Default value: 1024Recommended value:819200
osd_op_log_threshold
Number of operationlogs to be displayed at atime
Default value: 5Recommended value: 50
osd_crush_chooseleaf_type
Bucket type when theCRUSH rule useschooseleaf
Default value: 1Recommended value: 0
journal_max_write_bytes
Maximum number ofjournal bytes that canbe written at a time
Default value: 10485760Recommended value:1073714824
journal_max_write_entries
Maximum number ofjournal records that canbe written at a time
Default value: 100Recommended value:10000
[Client]
rbd_cache RBD cache Default value: TrueRecommended value: True
rbd_cache_size RBD cache size (inbytes)
Default value: 33554432Recommended value:335544320
rbd_cache_max_dirty
Maximum number ofdirty bytes allowedwhen the cache is set tothe writeback mode. Ifthe value is 0, the cacheis set to thewritethrough mode.
Default value: 25165824Recommended value:134217728
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 106
Parameter Description Suggestion
rbd_cache_max_dirty_age
Duration (in seconds)for which the dirty datais stored in the cachebefore being flushed tothe drives
Default value: 1Recommended value: 30
rbd_cache_writethrough_until_flush
This parameter is usedfor compatibility withthe virtio driver earlierthan linux-2.6.32. Itprevents the situationthat data is written backwhen no flush request issent. After thisparameter is set, librbdprocesses I/Os inwritethrough mode. Themode is switched towriteback only after thefirst flush request isreceived.
Default value: TrueRecommended value: False
rbd_cache_max_dirty_object
Maximum number ofobjects. The defaultvalue is 0, whichindicates that thenumber is calculatedbased on the RBD cachesize. By default, librbdlogically splits the driveimage in a unit of 4 MB.Each chunk object isabstracted as an object.librbd manages thecache object. You canincrease the value of thisparameter improve theperformance.
Default value: 0Recommended value: 2
rbd_cache_target_dirty
Dirty data size thattriggers writeback. Thevalue cannot exceed thevalue ofrbd_cache_max_dirty.
Default value: 16777216Recommended value:235544320
Optimizing the PG Distribution● Purpose
Adjust the number of PGs on each OSD to balance the load on each OSD.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 107
● ProcedureBy default, Ceph allocates eight PGs/PGPs to each storage pool. Whencreating a storage pool, run the ceph osd pool create {pool-name} {pg-num}{pgp-num} command to specify the number of PGs/PGPs, or run the ceph osdpool set {pool_name} pg_num {pg-num} and ceph osd pool set {pool_name}pgp_num {pgp-num} command to change the number of PGs/PGPs createdin a storage pool. After the modification, run the ceph osd pool get{pool_name} pg_num/pgp_num command to check the number of PGs/PGPsin the storage pool.The default value of ceph balancer mode is none. You can run the cephbalancer mode upmap command to change it to upmap. The Ceph balancerfunction is disabled by default. You can run the ceph balancer on/offcommand is used to enable or disable the Ceph balancer function.Table 4-12 describes the PG distribution parameters.
Table 4-12 PG distribution parameters
Parameter Description Suggestion
pg_num Total PGs =(Total_number_of_OSDx 100)/max_replication_countRound up the result tothe nearest integerpower of 2.
Default value: 8Symptom: A warning isdisplayed if the number ofPGs is insufficient.Suggestion: Calculate thevalue based on theformula.
pgp_num Set the number of PGPsto be the same as thatof PGs.
Default value: 8Symptom: It isrecommended that thenumber of PGPs be thesame as the number ofPGs.Suggestion: Calculate thevalue based on theformula.
ceph_balancer_mode
Enable the balancerplug-in and set theplug-in mode toupmap.
Default value: noneSymptom: If the number ofPGs is unbalanced, someOSDs may be overloadedand become bottlenecks.Recommended value:upmap
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 108
NO TE
● The number of PGs carried by each OSD must be the same or close. Otherwise,some OSDs may be overloaded and become bottlenecks. The balancer plug-in canbe used to optimize the PG distribution. You can run the ceph balancer eval orceph pg dump command to view the PG distribution.
● Run the eph balancer mode upmap and ceph balancer on commands toautomatically balance and optimize Ceph PGs. Ceph adjusts the distribution of afew PGs every 60 seconds. Run the ceph balancer eval or ceph pg dumpcommand to view the PG distribution. If the PG distribution does not change, thedistribution is optimal.
● The PG distribution of each OSD affects the load balancing of write data. Inaddition to optimizing the number of PGs corresponding to each OSD, thedistribution of the primary PGs also needs to be optimized. That is, the primaryPGs need to be distributed to each OSD as evenly as possible.
Binding OSDs to CPU Cores● Purpose
Bind each OSD process to a fixed CPU core.● Procedure
Add osd_numa_node = <NUM> to the /etc/ceph/ceph.conf file.Table 4-13 describes the optimization items.
Table 4-13 OSD core binding parameters
Parameter Description Suggestion
[osd.n]
osd_numa_node Bind the osd.n daemonprocess to a specified idleNUMA node, which is anode other than thenodes that process theNIC software interrupt.
This parameter has nodefault value.Symptom: If the CPU ofeach OSD process is thesame as that of the NICinterrupt, some CPUs maybe overloaded.Suggestion: To balance theCPU load pressure, avoidrunning each OSD processand NIC interrupt process(or other processes withhigh CPU usage) on thesame NUMA node.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 109
NO TE
● The Ceph OSD daemon process and NIC software interrupt process must run ondifferent NUMA nodes. Otherwise, CPU bottlenecks may occur when the networkload is heavy. By default, Ceph evenly allocates OSD processes to all CPU cores.You can add the osd_numa_node parameter to the ceph.conf file to avoid runningeach OSD process and NIC interrupt process (or other processes with high CPUusage) on the same NUMA node.
● Optimizing the Network Performance describes how to bind NIC softwareinterrupts to the CPU core of the NUMA node to which the NIC belongs. When thenetwork load is heavy, the usage of the CPU core bound to the software interruptsis high. Therefore, you are advised to set osd_numa_node to a NUMA nodedifferent from that of the NIC. For example, run the cat /sys/class/net/PortName/device/numa_node command to query the NUMA node of the NIC. If theNIC belongs to NUMA node 2, set osd_numa_node = 0 or osd_numa_node = 1 toprevent the OSD and NIC software interrupt from using the same CPU core.
Optimizing Compression Algorithm Configuration Parameters● Purpose
Adjust the compression algorithm configuration parameters to optimize theperformance of the compression algorithm.
● ProcedureThe default value of bluestore_min_alloc_size_hdd for Ceph is 32 KB. Thevalue of this parameter affects the size of the final data obtained after thecompression algorithm is run. Set this parameter to a smaller value tomaximize the compression rate of the compression algorithm.By default, Ceph uses five threads to process I/O requests in an OSD process.After the compression algorithm is enabled, the number of threads can causea performance bottleneck. Increase the number of threads to maximize theperformance of the compression algorithm.The following table describes the PG distribution parameters:
Parameter Description Suggestion
bluestore_min_alloc_size_hdd
Minimum size of objectsallocated to the HDDdata disks in theBlueStore storageengine
Default value: 32768Recommended value: 8192
osd_op_num_shards_hdd
Number of shards for anHDD data disk in anOSD process
Default value: 5Recommended value: 12
osd_op_num_threads_per_shard_hdd
Average number ofthreads of an OSDprocess for each HDDdata disk shard
Default value: 1Recommended value: 2
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 110
Enabling BcacheBcache is a block layer cache of the Linux kernel. It uses SSDs as the cache ofHDDs for acceleration. To enable the Bcache kernel module, you need torecompile the kernel. For details, see the Bcache User Guide (CentOS 7.6).
Using the I/O Passthrough ToolThe I/O passthrough tool is a process optimization tool for balanced scenarios ofthe Ceph cluster. It can automatically detect and optimize OSDs in the Cephcluster. For details on how to use this tool, see the I/O Passthrough Tool UserGuide.
4.2.4 KAE zlib Compression Tuning● Purpose
Optimize zlib compression to maximize the CPU capability of processing OSDsand maximize the hardware performance.
● Procedurezlib compression is processed by the KAE.
Preparing the EnvironmentNO TE
Before installing the accelerator engine, you need to apply for and install a license.License application guide:https://support.huawei.com/enterprise/zh/doc/EDOC1100068122/b9878159Installation guide:https://support.huawei.com/enterprise/en/doc/EDOC1100048786/ba20dd15
Download the acceleration engine installation package and developer Guide.
Download link: https://github.com/kunpengcompute/KAE/tags
Installing the Acceleration EngineNO TE
The developer guide describes how to install and use all modules of the accelerator engine.Select an appropriate installation mode based on the developer guide.For details, see Installing the KAE Software Package Using Source Code.
Step 1 Install the acceleration engine according to the developer guide.
Step 2 Install the zlib library.
1. Download KAEzip.2. Download zlib-1.2.11.tar.gz from the zlib official website and copy it to
KAEzip/open_source.3. Perform the compilation and installation.
cd KAEzipsh setup.sh install
The zlib library is installed in /usr/local/kaezip.
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 111
Step 3 Back up the connection.mv /lib64/libz.so.1 /lib64/libz.so.1-bak
Step 4 Replace the zlib software compression algorithm dynamic library.cd /usr/local/kaezip/libcp libz.so.1.2.11 /lib64/mv /lib64/libz.so.1 /lib64/libz.so.1-bakln -s /lib64/libz.so.1.2.11 /lib64/libz.so.1
NO TE
In the cd /usr/local/zlib command, /usr/local/zlib indicates the zlib installation path.Change it as required.
----End
NO TE
If the Ceph cluster is running before the dynamic library is replaced, run the followingcommand on all storage nodes to restart the OSDs for the change to take effect after thedynamic library is replaced:systemctl restart ceph-osd.target
Changing the Default Number of Accelerator QueuesNO TE
The default number of queues of the hardware accelerator is 256. To fully utilize theperformance of the accelerator, change the number of queues to 512 or 1024.
Step 1 Remove hisi_zip.rmmod hisi_zip
Step 2 Set the default accelerator queue parameter pf_q_num=512.vi /etc/modprobe.d/hisi_zip.confoptions hisi_zip uacce_mode=2 pf_q_num=512
Step 3 Load hisi_zip.modprobe hisi_zip
Step 4 Check the hardware accelerator queue.cat /sys/class/uacce/hisi_zip-*/attrs/available_instances
The change is successful if the following information is displayed.
Step 5 Check the dynamic library links. If libwd.so.1 is contained in the command output,the operation is successful.ldd /lib64/libz.so.1
----End
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 112
Adapting Ceph to the AcceleratorNO TE
Currently, the mainline Ceph versions allow configuring the zlib compression mode usingthe configuration file. The released Ceph release versions (up to v15.2.3) adopt the zlibcompression mode without the data header and tail. However, the current hardwareacceleration library supports only the mode with the data header and tail. Therefore, theCeph source code needs to be modified to adapt to the Kunpeng hardware accelerationlibrary. For details about the modification method, see the latest patch that has beenincorporated into the mainline version:
https://github.com/ceph/ceph/pull/34852
The following uses Ceph 14.2.11 as an example to describe how Ceph adapts to the zlibcompression engine.
Step 1 Obtain the source code.
Source code download address: https://download.ceph.com/tarballs/
After the source code package is downloaded, save it to the /home directory onthe server.
Step 2 Obtain the patch and save it to the /home directory.
https://github.com/kunpengcompute/ceph/releases/download/v14.2.11/ceph-14.2.11-glz.patch
Step 3 Go to the /home directory, decompress the source code package and enter thedirectory generated after decompression.cd /home && tar -zxvf ceph-14.2.11.tar.gz && cd ceph-14.2.11/
Step 4 Apply the patch in the root directory of the source code.cd /home/ceph-14.2.11patch -p1 < ceph-14.2.11-glz.patch
Step 5 After modifying the source code, compile Ceph.
● CentOS: See Ceph 14.2.1 Porting Guide (CentOS 7.6).
● openEuler: See Ceph 14.2.8 Porting Guide (openEuler 20.03).
Step 6 Install Ceph.
Step 7 Modify the ceph.conf file to configure the zlib compression mode.vi /etc/ceph/ceph.confcompressor_zlib_winsize=15
Step 8 Restart the Ceph cluster for the configuration to take effect.ceph daemon osd.0 config show|grep compressor_zlib_winsize
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 113
----End
Kunpeng BoostKit for SDSTuning Guide 4 Ceph File Storage Tuning Guide
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 114
A Change History
Date Description
2021-09-13 This issue is the tenth official release.Added 1 Using the Kunpeng Hyper Tuner for Tuning.
2021-07-14 This issue is the ninth official release.Added the adaption of Ceph storage tuning to openEuler20.03 LTS SP1.
2021-06-25 This issue is the eighth official release.Changed the processor model from "Kunpeng 920 5230" to"Kunpeng 920 5220."
2021-05-26 This issue is the seventh official release.● Changed zlib hardware acceleration to KAE zlib
compression.● Changed MD5 hardware acceleration to the KAE MD5
digest algorithm.
2021-03-23 This issue is the sixth official release.Changed the solution name from "Kunpeng SDS solution"to "Kunpeng BoostKit for SDS."
2021-01-19 This issue is the fifth official release.● Modified the Adapting Ceph to the Accelerator
operation procedure in the 3 Ceph Object StorageTuning Guide.
● Added the reference to the I/O Passthrough Tool UserGuide to the I/O passthrough tool tuning.
2020-09-27 This issue is the fourth official release.Added information about the I/O passthrough tool.
Kunpeng BoostKit for SDSTuning Guide A Change History
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 115
Date Description
2020-06-29 This issue is the third official release.● Modified the Bcache-related reference.● Added 3.4.3 KAE MD5 Digest Algorithm Tuning in the
3 Ceph Object Storage Tuning Guide.
2020-05-09 This issue is the second official release.Modified figures in the documents.
2020-03-20 This issue is the first official release.
Kunpeng BoostKit for SDSTuning Guide A Change History
Issue 10 (2021-09-13) Copyright © Huawei Technologies Co., Ltd. 116