Email Storage with Ceph @ SUSE Enterprise Storage Danny Al-Gaaf Senior Cloud Technologist Deutsche Telekom AG [email protected]
EmailStoragewithCeph@SUSEEnterpriseStorage
DannyAl-GaafSeniorCloudTechnologistDeutscheTelekomAGdanny.al-gaaf@telekom.de
Agenda
••••••••
TelekomMailplatformMotivationCephlibrmbanddovecotHardwarePlacementPoCandNextStepsConclusion
1 . 2
TelekomMail
••••••
DT'smailplatformforcustomersdovecotNetwork-AttachedStorage(NAS)NFS(sharded)~1.3petabytenetstorage~39millionaccounts
2 . 2
NFSStatistics~42%usablerawspace
NFSIOPS
••
max:~835,000avg:~390,000
relevantIO:
••
WRITE:107,700/50,000READ:65,700/30,900
2 . 5
EmailStatistics6.7billionemails
••
1.2petabytenetcompression
1.2billionindex/cache/metadatafiles
••
avg:24kiBmax:~600MiB
2 . 6
Howareemailsstored?Emailsarewrittenonce,readmany(WORM)
Usagedependson:
••
protocol(IMAPvsPOP3)userfrontend(mailervswebmailer)
usuallyseparatedmetadata,cachesandindexes
• lossofmetadata/indexesiscritical
withoutattachmentseasytocompress
2 . 8
Wheretostoreemails?Filesystem
••
maildirmailbox
Database
• SQL
Objectstore
••
S3Swift
2 . 9
Motivation
••••••
scale-outvsscale-upfastselfhealingcommodityhardwarepreventvendorlock-inopensourcewherefeasiblereduceTotalCostofOwnership(TCO)
3 . 2
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self managing, intelligent storage nodes
LIBRADOS
A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Rubyand PHP
RADOSGW
A bucket-based REST Gateway, compatible with
S3 and Swift
RBD
A reliable and fully-distributed block device,with a Linux kernel clientand a QEMU/KVM driver
CEPH FS
A POSIX-compliantdistributed file system,with a Linux kernel clientand support for FUSE
CLIENTHOST/VMAPPAPP
3 . 3
CephOptionsFilesystem
•••
CephFSNFSGatewayviaRGWanyfilesystemonRBD
Objectstore
••
S3/SwiftviaRGWRADOS
3 . 4
WheretostoreinCeph?CephFS
••••
sameissuesasNFSmailstorageonPOSIXlayeraddscomplexitynooptionforemailsusableformetadata/caches/indexes
Security
••
requiresdirectaccesstostoragenetworkonlyfordedicatedplatform
M
MRADOSCLUSTER
LINUXHOSTKERNELMODULE
0110
metadatadata
3 . 5
WheretostoreinCeph?RBD
•••••
requiresshardingandlargeRBDsrequiresRBD/fsextendscenariosregularaccountmigrationnosharingbetweenclientsimpracticable
Security
••
nodirectaccesstostoragenetworkrequiredsecurethroughhypervisorabstraction(libvirt)
M
MRADOSCLUSTER
HYPERVISORLIBRBD
VM
3 . 6
WheretostoreinCeph?RadosGW
••••
canstoreemailsasobjectsextranetworkhopspotentialbottleneckverylikelynotfastenough
Security
••
nodirectaccesstoCephstoragenetworkrequiredconnectiontoRadosGWcanbesecured(WAF)
M
MRADOSCLUSTER
APPLICATION
RADOSGWLIBRADOS
REST
socket
3 . 7
WheretostoreinCeph?Librados
••••
directaccesstoRADOSparallelI/Onotoptimizedforemailshowtohandlemetadata/caches/indexes?
Security
••
requiresdirectaccesstostoragenetworkonlyfordedicatedplatform
M
MRADOSCLUSTER
APP
LIBRADOS
APP
LIBRADOS
3 . 8
DovecotOpensourceproject(LGPL2.1,MIT)
72%marketshare(openemailsurvey.org,02/2017)
Objectstorepluginavailable(obox)
••••
supportsonlyRESTAPIslikeS3/SwiftnotopensourcerequiresDovecotProlargeimpactonTCO
4 . 2
DovecotProoboxPlugin
IMAP4/POP3/LMTPprocess
StorageAPI
dovecotoboxbackend
metacacheRFC5322mails
fsAPI
fscachebackend
objectstorebackend
RFC5322objects
objectstore
index&cachebundles
synclocalindex&cachewith
objectstore
writeindex&cachetolocalstore
localstorage
mailcache
localstorage
4 . 3
DT'sapproach
•••••
–––
noopensourcesolutiononthemarketclosedsourceisnooptiondevelop/sponsorasolutionopensourceitpartnerwith:
WidodenHollander(42on.com)forconsultingTallenceAGfordevelopmentSUSEforCeph
4 . 4
CephpluginforDovecotFirstStep:hybridapproach
Emails
• storeinRADOSCluster
Metadataandindexes
• storeinCephFS
Beasgenericaspossible
••
splitoutcodeintolibrariesintegrateintocorrespondingupstreamprojects
Mail User Agent
RADOSCLUSTER
Ceph Client
Dovecot
rbox storage plugin
librmb CephFS
librados Linux Kernel
IMAP/POP
4 . 5
Libradosmailbox(librmb)Genericemailabstractionontopoflibrados
Outofscope:
•–
userdataandcredentialstoragetargetarehugeinstallationswhereusuallyarealreadysolutionsinplace
•–
fulltextindexestherearesolutionsalreadyavailableandworkingoutsideemailstorage
4 . 6
Libradosmailbox(librmb)
M
MRADOSCLUSTER
M
MRADOSCLUSTER
RFC5322
CACHE
LOCALSTORAGE
IMAP4/POP3/LMTPprocess
StorageAPI
dovecotrboxbackend
librmb
librados
dovecotlib-index
CephFS
LinuxkernelRFC5322mails
RFC5322objects index&metadata
4 . 7
librmb-MailObjectFormatMailsareimmutableregardingtheRFC-5322content
RFC-5322contentstoredinRADOSdirectly
ImmutableattributesusedbyDovecotstoredinRADOSxattr
•••••••
rboxformatversionGUIDReceivedandsavedatePOP3UIDLandPOP3orderMailboxGUIDPhysicalandvirtualsizeMailUID
WritableattributesarestoredinDovecotindexfiles4 . 8
DumpemaildetailsfromRADOS
$>rmb-pmail_storage-Nt1lsM=ad54230e65b49a59381100009c60b9f7
mailbox_count:1
MAILBOX:M(mailbox_guid)=ad54230e65b49a59381100009c60b9f7mail_total=2,mails_displayed=2mailbox_size=5539bytes
MAIL:U(uid)=4oid=a2d69f2868b49a596a1d00009c60b9f7R(receive_time)=TueJan1400:18:112003S(save_time)=MonAug2112:22:322017Z(phy_size)=2919V(v_size)=2919stat_size=2919M(mailbox_guid)=ad54230e65b49a59381100009c60b9f7G(mail_guid)=a3d69f2868b49a596a1d00009c60b9f7I(rbox_version):0.1[..]
4 . 9
RADOSDictionaryPluginmakeuseofCephomapkey/valuestore
RADOSnamespaces
•
•
shared/<key>
priv/<key>
usedbyDovecottostoremetadata,quota,...
4 . 10
It'sopensource!License:LGPLv2.1
Language:C++
Location:
SupportedDovecotversions:
••
2.2>=2.2.212.3
github.com/ceph-dovecot/
4 . 11
CephRequirementsPerformance
••
writeperformanceforemailsiscriticalmetadata/indexread/writeperformance
Cost
••
ErasureCoding(EC)foremailsreplicationforCephFS
Reliability
• MUSTsurvicefailureofdisk,server,rackandevenfirecompartments
4 . 12
WhichCephRelease?RequiredFeatures:
•–
•––
•
Bluestoreshouldbeatleast2xfasterthanfilestore
CephFSStablereleaseMulti-MDS
Erasurecoding
4 . 13
HardwareCommodityx86_64server
••
–
•••
HPEProLiantDL380Gen9dualsocket
Intel®Xeon®E5V4
2xIntel®X710-DA2Dual-port10G2xbootSSDs,SATA,HBAHBA,noseparateRAIDcontroller
CephFS,Rados,MDSandMONnodes
5 . 2
StorageNodesCephFSSSDNodes
•••
CPU:[email protected],6Cores,turbo3.7GHzRAM:256GByte,DDR4,ECCSSD:8x1.6TBSSD,3DWPD,SAS,RR/RW(4kQ16)125k/93kiops
RadosHDDNodes
•••
–
•
CPU:[email protected],10Cores,turbo3.4GHzRAM:128GByte,DDR4,ECCSSD:2x400GByte,3DWPD,SAS,RR/RW(4kQ16)108k/49kiops
forBlueStoredatabaseetc.
HDD:10x4TByte,7.2K,128MBcache,SAS,PMR
5 . 3
ComputeNodesMDS
••
CPU:[email protected],6Cores,turbo3.7GHzRAM:256GByte,DDR4,ECC
MON/SUSEadmin
••
CPU:[email protected],10Cores,turbo3.4GHzRAM:64GByte,DDR4,ECC
5 . 4
WhythisspecificHW?Communityrecommendations?
••
OSD:1x64-bitAMD-64,1GBRAM/1TBofstorage,2x1GBitNICsMDS:1x64-bitAMD-64quad-core,1GBRAMminimumperMDS,2x1GBitNICs
NUMA,highclockedCPUsandlargeRAMoverkill?
••
–
•
vendordidnotoffersinglesocketnodesfornumberofdrivesMDSperformanceismostlyCPUclockboundandpartlysinglethreaded
highclockedCPUsforfastsinglethreadedperformance
largeRAM:bettercaching!
5 . 5
IssuesDatacenter
••
usuallytwoindependentfirecompartments(FCs)mayadditionalvirtualFCs
Requirements
••••
lossofcustomerdataMUSTbepreventedanyserver,switchorrackcanfailoneFCcanfaildatareplicationatleast3times(orequivalent)
6 . 2
IssuesQuestions
••••
Howtoplace3copiesintwoFCs?HowindependentandreliablearethevirtualFCs?Networkarchitecture?Networkbandwidth?
6 . 3
FireCompartmentA
34333231302928272625242322212019181716151413121110987654321
Switches
MON
MDS
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
SSDnode
SSDnode
SSDnode
FireCompartmentB
34333231302928272625242322212019181716151413121110987654321
Switches
MON
MDS
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
HDDnode
SSDnode
SSDnode
SSDnode
SUSEAdmin
FireCompartmentC(thirdroom)
34333231302928272625242322212019181716151413121110987654321
Switches
MON
MDS
SSDnode
SSDnode
SSDnode
6 . 4
Network1GOAMnetwork
10Gnetwork
• 2NICs/4portspernode/SFP+DAC
Multi-chassisLinkAggregation(MC-LAG/M-LAG)
• foraggregationandfail-over
Spine-Leafarchitecture
••••
interconnectmustnotreflecttheoreticalrack/FCbandwidthL2:terminatedinrackL3:TOR<->spine/spine<->spineBorderGatewayProtocol(BGP)
6 . 5
2x 40G QSFP L3 crosslink
40G QSFPLAG, L3, BGP
N*10G SFP+MCLAG
rack , L2 terminated
Spine switch
DC-R
FC1 FC2 vFC
6 . 6
StatusDovecotCephPlugin
•–
•–
••
–
opensourced
initialSLES12-SP3,openSUSELeap42.3,andTumbleweedRPMs
stillunderdevelopmentstillincludeslibrmb
planned:movetoCephproject
https://github.com/ceph-dovecot/
https://build.opensuse.org/package/show/home:dalgaaf:dovecot-ceph-plugin
7 . 2
TestingFunctionaltesting
•––
•
setupsmall5-nodeclusterSLES12-SP3GMCSES5Beta
runDovecotfunctionaltestsagainstCeph
7 . 3
Proof-of-ConceptHardware
•••
9SSDnodesforCephFS12HDDnodes3MDS/3MON
2FCs+1vFC
Testing
••••
runloadtestsrunfailurescenariosagainstCephimproveandtuneCephsetupverifyandoptimizehardware
7 . 4
FurtherDevelopmentGoal:PureRADOSbackend,storemetadata/indexinCephomap
M
MRADOSCLUSTER
RFC5322
CACHE
LOCALSTORAGE
IMAP4/POP3/LMTPprocess
StorageAPI
dovecotrboxbackend
librmb
librados
RFC5322mails
RFC5322objectsandindex7 . 5
NextStepsProduction
••••
––
verifyifallrequirementsarefulfilledintegrateinproductionmigrateusersstep-by-stepextendtofinalsize
128HDDnodes,1200OSDs,4,7PiB15SSDnodes,120OSDs,175TiB
7 . 6
SummaryandconclusionsCephcanreplaceNFS
•••
mailsinRADOSmetadata/indexesinCephFSBlueStore,EC
librmbanddovecotrbox
•••
OpenSource,LGPLv2.1,nolicensecostslibrmbcanbeusedinnon-dovecotsystemsstillunderdevelopment
PoCwithdovecotinprogress
Performanceoptimization
8 . 2
Beinvitedto:Participate!
Tryit,testit,feedbackandreportbugs!Contribute!
github.com/ceph-dovecot/
Thankyou.9 . 1
9 . 3
GeneralDisclaimerThisdocumentisnottobeconstruedasapromisebyanyparticipatingcompanytodevelop,deliver,ormarketaproduct.Itisnotacommitmenttodeliveranymaterial,code,orfunctionality,andshouldberelieduponinmakingpurchasingdecisions.SUSEmakesnorepresentationsorwarrantieswithrespecttothecontentsofthisdocument,andspecificallydisclaimsanyexpressorimpliedwarrantiesofmerchantabilityorfitnessforanyparticularpurpose.Thedevelopment,release,andtimingoffeaturesorfunctionalitydescribedforSUSEproductsremainsatthesolediscretionofSUSE.Further,SUSEreservestherighttorevisethisdocumentandtomakechangestoitscontent,atanytime,withoutobligationtonotifyanypersonorentityofsuchrevisionsorchanges.AllSUSEmarksreferencedinthispresentationaretrademarksorregisteredtrademarksofNovell,Inc.intheUnitedStatesandothercountries.Allthird-partytrademarksarethepropertyoftheirrespectiveowners.
9 . 4