Top Banner
Email Storage with Ceph @ SUSE Enterprise Storage Danny Al-Gaaf Senior Cloud Technologist Deutsche Telekom AG [email protected]
58

Email Storage with Ceph - SUSECON2017

Jan 21, 2018

Download

Technology

Danny Al-Gaaf
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Email Storage with Ceph - SUSECON2017

EmailStoragewithCeph@SUSEEnterpriseStorage

DannyAl-GaafSeniorCloudTechnologistDeutscheTelekomAGdanny.al-gaaf@telekom.de

Page 2: Email Storage with Ceph - SUSECON2017

Agenda

••••••••

TelekomMailplatformMotivationCephlibrmbanddovecotHardwarePlacementPoCandNextStepsConclusion

1 . 2

Page 3: Email Storage with Ceph - SUSECON2017

TelekomMailplatform

2 . 1

Page 4: Email Storage with Ceph - SUSECON2017

TelekomMail

••••••

DT'smailplatformforcustomersdovecotNetwork-AttachedStorage(NAS)NFS(sharded)~1.3petabytenetstorage~39millionaccounts

2 . 2

Page 5: Email Storage with Ceph - SUSECON2017

NFSOperations

2 . 3

Page 7: Email Storage with Ceph - SUSECON2017

NFSStatistics~42%usablerawspace

NFSIOPS

••

max:~835,000avg:~390,000

relevantIO:

••

WRITE:107,700/50,000READ:65,700/30,900

2 . 5

Page 8: Email Storage with Ceph - SUSECON2017

EmailStatistics6.7billionemails

••

1.2petabytenetcompression

1.2billionindex/cache/metadatafiles

••

avg:24kiBmax:~600MiB

2 . 6

Page 9: Email Storage with Ceph - SUSECON2017

EmailDistribution

2 . 7

Page 10: Email Storage with Ceph - SUSECON2017

Howareemailsstored?Emailsarewrittenonce,readmany(WORM)

Usagedependson:

••

protocol(IMAPvsPOP3)userfrontend(mailervswebmailer)

usuallyseparatedmetadata,cachesandindexes

• lossofmetadata/indexesiscritical

withoutattachmentseasytocompress

2 . 8

Page 11: Email Storage with Ceph - SUSECON2017

Wheretostoreemails?Filesystem

••

maildirmailbox

Database

• SQL

Objectstore

••

S3Swift

2 . 9

Page 13: Email Storage with Ceph - SUSECON2017

Motivation

••••••

scale-outvsscale-upfastselfhealingcommodityhardwarepreventvendorlock-inopensourcewherefeasiblereduceTotalCostofOwnership(TCO)

3 . 2

Page 14: Email Storage with Ceph - SUSECON2017

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self managing, intelligent storage nodes

LIBRADOS

A library allowing apps to directly access RADOS, with support for C, C++, Java, Python, Rubyand PHP

RADOSGW

A bucket-based REST Gateway, compatible with

S3 and Swift

RBD

A reliable and fully-distributed block device,with a Linux kernel clientand a QEMU/KVM driver

CEPH FS

A POSIX-compliantdistributed file system,with a Linux kernel clientand support for FUSE

CLIENTHOST/VMAPPAPP

3 . 3

Page 15: Email Storage with Ceph - SUSECON2017

CephOptionsFilesystem

•••

CephFSNFSGatewayviaRGWanyfilesystemonRBD

Objectstore

••

S3/SwiftviaRGWRADOS

3 . 4

Page 16: Email Storage with Ceph - SUSECON2017

WheretostoreinCeph?CephFS

••••

sameissuesasNFSmailstorageonPOSIXlayeraddscomplexitynooptionforemailsusableformetadata/caches/indexes

Security

••

requiresdirectaccesstostoragenetworkonlyfordedicatedplatform

M

MRADOSCLUSTER

LINUXHOSTKERNELMODULE

0110

metadatadata

3 . 5

Page 17: Email Storage with Ceph - SUSECON2017

WheretostoreinCeph?RBD

•••••

requiresshardingandlargeRBDsrequiresRBD/fsextendscenariosregularaccountmigrationnosharingbetweenclientsimpracticable

Security

••

nodirectaccesstostoragenetworkrequiredsecurethroughhypervisorabstraction(libvirt)

M

MRADOSCLUSTER

HYPERVISORLIBRBD

VM

3 . 6

Page 18: Email Storage with Ceph - SUSECON2017

WheretostoreinCeph?RadosGW

••••

canstoreemailsasobjectsextranetworkhopspotentialbottleneckverylikelynotfastenough

Security

••

nodirectaccesstoCephstoragenetworkrequiredconnectiontoRadosGWcanbesecured(WAF)

M

MRADOSCLUSTER

APPLICATION

RADOSGWLIBRADOS

REST

socket

3 . 7

Page 19: Email Storage with Ceph - SUSECON2017

WheretostoreinCeph?Librados

••••

directaccesstoRADOSparallelI/Onotoptimizedforemailshowtohandlemetadata/caches/indexes?

Security

••

requiresdirectaccesstostoragenetworkonlyfordedicatedplatform

M

MRADOSCLUSTER

APP

LIBRADOS

APP

LIBRADOS

3 . 8

Page 20: Email Storage with Ceph - SUSECON2017

DovecotandCeph

4 . 1

Page 21: Email Storage with Ceph - SUSECON2017

DovecotOpensourceproject(LGPL2.1,MIT)

72%marketshare(openemailsurvey.org,02/2017)

Objectstorepluginavailable(obox)

••••

supportsonlyRESTAPIslikeS3/SwiftnotopensourcerequiresDovecotProlargeimpactonTCO

4 . 2

Page 22: Email Storage with Ceph - SUSECON2017

DovecotProoboxPlugin

IMAP4/POP3/LMTPprocess

StorageAPI

dovecotoboxbackend

metacacheRFC5322mails

fsAPI

fscachebackend

objectstorebackend

RFC5322objects

objectstore

index&cachebundles

synclocalindex&cachewith

objectstore

writeindex&cachetolocalstore

localstorage

mailcache

localstorage

4 . 3

Page 23: Email Storage with Ceph - SUSECON2017

DT'sapproach

•••••

–––

noopensourcesolutiononthemarketclosedsourceisnooptiondevelop/sponsorasolutionopensourceitpartnerwith:

WidodenHollander(42on.com)forconsultingTallenceAGfordevelopmentSUSEforCeph

4 . 4

Page 24: Email Storage with Ceph - SUSECON2017

CephpluginforDovecotFirstStep:hybridapproach

Emails

• storeinRADOSCluster

Metadataandindexes

• storeinCephFS

Beasgenericaspossible

••

splitoutcodeintolibrariesintegrateintocorrespondingupstreamprojects

Mail User Agent

RADOSCLUSTER

Ceph Client

Dovecot

rbox storage plugin

librmb CephFS

librados Linux Kernel

IMAP/POP

4 . 5

Page 25: Email Storage with Ceph - SUSECON2017

Libradosmailbox(librmb)Genericemailabstractionontopoflibrados

Outofscope:

•–

userdataandcredentialstoragetargetarehugeinstallationswhereusuallyarealreadysolutionsinplace

•–

fulltextindexestherearesolutionsalreadyavailableandworkingoutsideemailstorage

4 . 6

Page 26: Email Storage with Ceph - SUSECON2017

Libradosmailbox(librmb)

M

MRADOSCLUSTER

M

MRADOSCLUSTER

RFC5322

CACHE

LOCALSTORAGE

IMAP4/POP3/LMTPprocess

StorageAPI

dovecotrboxbackend

librmb

librados

dovecotlib-index

CephFS

LinuxkernelRFC5322mails

RFC5322objects index&metadata

4 . 7

Page 27: Email Storage with Ceph - SUSECON2017

librmb-MailObjectFormatMailsareimmutableregardingtheRFC-5322content

RFC-5322contentstoredinRADOSdirectly

ImmutableattributesusedbyDovecotstoredinRADOSxattr

•••••••

rboxformatversionGUIDReceivedandsavedatePOP3UIDLandPOP3orderMailboxGUIDPhysicalandvirtualsizeMailUID

WritableattributesarestoredinDovecotindexfiles4 . 8

Page 28: Email Storage with Ceph - SUSECON2017

DumpemaildetailsfromRADOS

$>rmb-pmail_storage-Nt1lsM=ad54230e65b49a59381100009c60b9f7

mailbox_count:1

MAILBOX:M(mailbox_guid)=ad54230e65b49a59381100009c60b9f7mail_total=2,mails_displayed=2mailbox_size=5539bytes

MAIL:U(uid)=4oid=a2d69f2868b49a596a1d00009c60b9f7R(receive_time)=TueJan1400:18:112003S(save_time)=MonAug2112:22:322017Z(phy_size)=2919V(v_size)=2919stat_size=2919M(mailbox_guid)=ad54230e65b49a59381100009c60b9f7G(mail_guid)=a3d69f2868b49a596a1d00009c60b9f7I(rbox_version):0.1[..]

4 . 9

Page 29: Email Storage with Ceph - SUSECON2017

RADOSDictionaryPluginmakeuseofCephomapkey/valuestore

RADOSnamespaces

shared/<key>

priv/<key>

usedbyDovecottostoremetadata,quota,...

4 . 10

Page 30: Email Storage with Ceph - SUSECON2017

It'sopensource!License:LGPLv2.1

Language:C++

Location:

SupportedDovecotversions:

••

2.2>=2.2.212.3

github.com/ceph-dovecot/

4 . 11

Page 31: Email Storage with Ceph - SUSECON2017

CephRequirementsPerformance

••

writeperformanceforemailsiscriticalmetadata/indexread/writeperformance

Cost

••

ErasureCoding(EC)foremailsreplicationforCephFS

Reliability

• MUSTsurvicefailureofdisk,server,rackandevenfirecompartments

4 . 12

Page 32: Email Storage with Ceph - SUSECON2017

WhichCephRelease?RequiredFeatures:

•–

•––

Bluestoreshouldbeatleast2xfasterthanfilestore

CephFSStablereleaseMulti-MDS

Erasurecoding

4 . 13

Page 33: Email Storage with Ceph - SUSECON2017

SUSEProductstouseSLES12-SP3andSES5

4 . 14

Page 35: Email Storage with Ceph - SUSECON2017

HardwareCommodityx86_64server

••

•••

HPEProLiantDL380Gen9dualsocket

Intel®Xeon®E5V4

2xIntel®X710-DA2Dual-port10G2xbootSSDs,SATA,HBAHBA,noseparateRAIDcontroller

CephFS,Rados,MDSandMONnodes

5 . 2

Page 36: Email Storage with Ceph - SUSECON2017

StorageNodesCephFSSSDNodes

•••

CPU:[email protected],6Cores,turbo3.7GHzRAM:256GByte,DDR4,ECCSSD:8x1.6TBSSD,3DWPD,SAS,RR/RW(4kQ16)125k/93kiops

RadosHDDNodes

•••

CPU:[email protected],10Cores,turbo3.4GHzRAM:128GByte,DDR4,ECCSSD:2x400GByte,3DWPD,SAS,RR/RW(4kQ16)108k/49kiops

forBlueStoredatabaseetc.

HDD:10x4TByte,7.2K,128MBcache,SAS,PMR

5 . 3

Page 37: Email Storage with Ceph - SUSECON2017

ComputeNodesMDS

••

CPU:[email protected],6Cores,turbo3.7GHzRAM:256GByte,DDR4,ECC

MON/SUSEadmin

••

CPU:[email protected],10Cores,turbo3.4GHzRAM:64GByte,DDR4,ECC

5 . 4

Page 38: Email Storage with Ceph - SUSECON2017

WhythisspecificHW?Communityrecommendations?

••

OSD:1x64-bitAMD-64,1GBRAM/1TBofstorage,2x1GBitNICsMDS:1x64-bitAMD-64quad-core,1GBRAMminimumperMDS,2x1GBitNICs

NUMA,highclockedCPUsandlargeRAMoverkill?

••

vendordidnotoffersinglesocketnodesfornumberofdrivesMDSperformanceismostlyCPUclockboundandpartlysinglethreaded

highclockedCPUsforfastsinglethreadedperformance

largeRAM:bettercaching!

5 . 5

Page 40: Email Storage with Ceph - SUSECON2017

IssuesDatacenter

••

usuallytwoindependentfirecompartments(FCs)mayadditionalvirtualFCs

Requirements

••••

lossofcustomerdataMUSTbepreventedanyserver,switchorrackcanfailoneFCcanfaildatareplicationatleast3times(orequivalent)

6 . 2

Page 41: Email Storage with Ceph - SUSECON2017

IssuesQuestions

••••

Howtoplace3copiesintwoFCs?HowindependentandreliablearethevirtualFCs?Networkarchitecture?Networkbandwidth?

6 . 3

Page 42: Email Storage with Ceph - SUSECON2017

FireCompartmentA

34333231302928272625242322212019181716151413121110987654321

Switches

MON

MDS

HDDnode

HDDnode

HDDnode

HDDnode

HDDnode

HDDnode

SSDnode

SSDnode

SSDnode

FireCompartmentB

34333231302928272625242322212019181716151413121110987654321

Switches

MON

MDS

HDDnode

HDDnode

HDDnode

HDDnode

HDDnode

HDDnode

SSDnode

SSDnode

SSDnode

SUSEAdmin

FireCompartmentC(thirdroom)

34333231302928272625242322212019181716151413121110987654321

Switches

MON

MDS

SSDnode

SSDnode

SSDnode

6 . 4

Page 43: Email Storage with Ceph - SUSECON2017

Network1GOAMnetwork

10Gnetwork

• 2NICs/4portspernode/SFP+DAC

Multi-chassisLinkAggregation(MC-LAG/M-LAG)

• foraggregationandfail-over

Spine-Leafarchitecture

••••

interconnectmustnotreflecttheoreticalrack/FCbandwidthL2:terminatedinrackL3:TOR<->spine/spine<->spineBorderGatewayProtocol(BGP)

6 . 5

Page 44: Email Storage with Ceph - SUSECON2017

2x 40G QSFP L3 crosslink

40G QSFPLAG, L3, BGP

N*10G SFP+MCLAG

rack , L2 terminated

Spine switch

DC-R

FC1 FC2 vFC

6 . 6

Page 45: Email Storage with Ceph - SUSECON2017

StatusandNextSteps

7 . 1

Page 46: Email Storage with Ceph - SUSECON2017

StatusDovecotCephPlugin

•–

•–

••

opensourced

initialSLES12-SP3,openSUSELeap42.3,andTumbleweedRPMs

stillunderdevelopmentstillincludeslibrmb

planned:movetoCephproject

https://github.com/ceph-dovecot/

https://build.opensuse.org/package/show/home:dalgaaf:dovecot-ceph-plugin

7 . 2

Page 47: Email Storage with Ceph - SUSECON2017

TestingFunctionaltesting

•––

setupsmall5-nodeclusterSLES12-SP3GMCSES5Beta

runDovecotfunctionaltestsagainstCeph

7 . 3

Page 48: Email Storage with Ceph - SUSECON2017

Proof-of-ConceptHardware

•••

9SSDnodesforCephFS12HDDnodes3MDS/3MON

2FCs+1vFC

Testing

••••

runloadtestsrunfailurescenariosagainstCephimproveandtuneCephsetupverifyandoptimizehardware

7 . 4

Page 49: Email Storage with Ceph - SUSECON2017

FurtherDevelopmentGoal:PureRADOSbackend,storemetadata/indexinCephomap

M

MRADOSCLUSTER

RFC5322

CACHE

LOCALSTORAGE

IMAP4/POP3/LMTPprocess

StorageAPI

dovecotrboxbackend

librmb

librados

RFC5322mails

RFC5322objectsandindex7 . 5

Page 50: Email Storage with Ceph - SUSECON2017

NextStepsProduction

••••

––

verifyifallrequirementsarefulfilledintegrateinproductionmigrateusersstep-by-stepextendtofinalsize

128HDDnodes,1200OSDs,4,7PiB15SSDnodes,120OSDs,175TiB

7 . 6

Page 52: Email Storage with Ceph - SUSECON2017

SummaryandconclusionsCephcanreplaceNFS

•••

mailsinRADOSmetadata/indexesinCephFSBlueStore,EC

librmbanddovecotrbox

•••

OpenSource,LGPLv2.1,nolicensecostslibrmbcanbeusedinnon-dovecotsystemsstillunderdevelopment

PoCwithdovecotinprogress

Performanceoptimization

8 . 2

Page 53: Email Storage with Ceph - SUSECON2017

Beinvitedto:Participate!

Tryit,testit,feedbackandreportbugs!Contribute!

github.com/ceph-dovecot/

Thankyou.9 . 1

Page 55: Email Storage with Ceph - SUSECON2017

9 . 3

Page 56: Email Storage with Ceph - SUSECON2017

GeneralDisclaimerThisdocumentisnottobeconstruedasapromisebyanyparticipatingcompanytodevelop,deliver,ormarketaproduct.Itisnotacommitmenttodeliveranymaterial,code,orfunctionality,andshouldberelieduponinmakingpurchasingdecisions.SUSEmakesnorepresentationsorwarrantieswithrespecttothecontentsofthisdocument,andspecificallydisclaimsanyexpressorimpliedwarrantiesofmerchantabilityorfitnessforanyparticularpurpose.Thedevelopment,release,andtimingoffeaturesorfunctionalitydescribedforSUSEproductsremainsatthesolediscretionofSUSE.Further,SUSEreservestherighttorevisethisdocumentandtomakechangestoitscontent,atanytime,withoutobligationtonotifyanypersonorentityofsuchrevisionsorchanges.AllSUSEmarksreferencedinthispresentationaretrademarksorregisteredtrademarksofNovell,Inc.intheUnitedStatesandothercountries.Allthird-partytrademarksarethepropertyoftheirrespectiveowners.

9 . 4

Page 57: Email Storage with Ceph - SUSECON2017