SPUD A Distributed High Performance Publish-Subscribe Cluster

SPUDA Distributed High Performance

Publish-Subscribe Cluster

Uriel Peled and Tal Kol

Guided by Edward Bortnikov

Software Systems LaboratorySoftware Systems LaboratoryFaculty of Electrical Engineering, TechnionFaculty of Electrical Engineering, Technion

Project Goal

Design and implement a general-purpose Publish-Subscribe server

Push traditional implementations into global scale performance demands

1 million concurrent clientsMillions of concurrent topicsHigh transaction rate

Demonstrate server abilities with a fun client application

What is Pub/Sub?

topic://traffic-jams/ayalon

subscribe

publish

accident in hashalom

accident in hashalom

What Can We Do With It?Collaborative Web

Browsing

others: others:

What Can We Do With It?Instant Messaging

Hi buddy!

Hi buddy!

Seems Easy To Implement, But…

“I’m behind a NAT, I can’t connect!”Not all client setups are server friendly

“Server is too busy, try again later?!”1 million concurrent clients is simply too much

“The server is so slow!!!”Service time grows exponentially with load

“A server crashed, everything is lost!”

Single points of failure will eventually fail

Naïve Implementation(example 1)

Simple UDP for client-server communication

No need for sessions since we send messagesVery low cost-per-clientSounds perfect?

NAT

NAT Traversal

UDP hole punchingNAT will accept UDP reply for a short windowOur measurements: 15-30 secondsKeep UDP pinging from each client every 15s

Days-long TCP sessionsNAT remembers current sessions for repliesIf WWW works - we should workIncreases dramatically cost-per-clientOur research: all IM’s do exactly this

Naïve Implementation(example 2)

Blocking I/O with one thread per client

Basic model for most servers (JAVA default)Traditional UNIX – fork for every clientSounds perfect?

500clients

500clients

500clients

Network I/O InternalsBlocking I/O – one thread per client

2MB stack, 1GB virtual space enough for only 512 (!)

Non-blocking I/O - selectLinear fd searches are very slow

Asynchronous I/O – completion portsThread pool to handle request completionOur measurements: 30,000 concurrent clients!What is the bottleneck?

Number of locked pages (zero-byte receives)

TCP/IP kernel driver non-paged pool allocations

Scalability

Scale upBuy a bigger box

Scale outBuy more boxes

Which one to do?Both!Push each box to its hardware maximum

1000’s of servers is impractical

Add relevant boxes as load increasesThe Google way (cheap PC server farms)

Identify Our Load Factors

Concurrent TCP clientsScale up: async-I/O, 0-byte-recv, larger NPPScale out: dedicate boxes to handle clients=> Connection Server (CS)

High transaction throughput (topic load)

Scale up: software optimizationsScale out: dedicate boxes to handle topics => Topic Server (TS)

Design the cluster accordingly

Network Architecture

C1 T1

C2

C3

T2

T3

Room 1

C1 T1

C2

C3

T2

Room 2

C1

T1C2

C3

T2

T3

Room 3

CLB1

CLB2

Client Load Balancing

CLB CS1

CS2

CS3

TS1

TS2request CS

load balance:- user location- CS client load

given CS2

loginsubscribepublish

Topic Load BalancingStatic

CS

TS0

TS3

TS2

TS1

subscribe:traffic

Room 0

subscribe:923481%4

=1

Topic Load BalancingDynamic

TS1

CS

Room 0

Room 1

Room 2

TS1

TS1

subscribe

subscribeR0: 345KR1: ?R2: ?

subscribeR0: 345KR1: 278KR2: ?

subscribeR0: 345KR1: 278KR2: 301K

subscribe

R1: 278K

handlesubscribe

Performance PitfallsData Copies

Single instance - reference counting (REF_BLOCK)

Multi-buffer messages (MESSAGE: header, body, tail)

Context SwitchesFlexible module exec foundation (MODULE)

Processor num sized thread pools

Memory AllocationMM: custom memory pools (POOL, POOL_BLOCK)

fine-grained locking, pre-allocation, batching, single-size

Lock ContentionEVENT, MUTEX, RW_MUTEX, interlocked API

+Init()+Cleanup()+ReloadConfig()+Start()+ThreadMain()

-Parent : Module-CompletionPort : int-NumThreads : int

General::Module

+ShowHelp()+HandleCommand()

-ServerType : int

General::Application

+ReloadFile()+GetStringParam()+GetDwordParam()+GetBooleanParam()+GetIpParam()

-Filename : char*-Values : struct

General::Config

1

+Log()+DebugLog()+Assert()

-Debug : bool

General::Log

1

+UpdateValue()+GetStats()+WriteStatsToFile()+RequestStatsFromAllServers()+PrintStatsString()

-Values : struct

General::Stats

1

+AllocBlock()+AddFreeBlocks()

-SizeLists : struct

Pool

1

-Size : int-Body : char[]

Memory::BodyBlock

-Header : struct

Memory::HeaderBlock

1 1

+operator new()+operator delete()

-BlockSize : int

Memory::PoolBlock

+RefcountInc()+RefcountDec()+Free()

-Refcount : int

Memory::RefBlock

CLBSpecific::ClbServer CSSpecific::CsServer TSSpecific::TsServer

Class Diagram (Application)

Class Diagram (TS, CS)+Init()+Cleanup()+ReloadConfig()+Start()+ThreadMain()


General::Module


-ServerType : int


+StartServer()+ConnectSocket()+DoSendOperation()+DoReceiveOperation()

IOStack::TcpIO

+StartServer()+DoSendOperation()+DoReceiveOperarion()

IOStack::UdpIO

1

1


-ServerSocket : int

IOStack::IO

+RegisterMessageHandler()+SendMessage()+HandleSendCompleted()+HandleServerStarted()+HandleNewConnection()+HandleHeaderReceived()+HandleBodyReceived()+HandleDatagramReceived()+HandleIOFailure()

-MessageHandlers : MessageHandler[]

IOStack::ProtocolHandler

1

+HandleReceivedMessage()+CreateMessage()+HandleNewPeer()+HandleIOFailure()

MsgHandlers::PingHandler

+SendMessage()+HandleReceivedMessage()+HandleNewPeer()+HandleIOFailure()+CreateMessage()+CreateMessageTail()

-SenderId : long

IOStack::MessageHandler

+GetNumRooms()+GetNumServersInRoom()+GetServerByIndex()

-Servers : struct-PingInterval : int

General::ServerDb

+GetIdFromPeer()+GetAddressFromPeer()+Add()+Remove()+GetPeerFromId()+GetPeerFromAddress()

General::PeerDb

-RoomIndex : int-IndexInRoom : int

Types::Server

-Socket : int-Address : struct-Id : int-State : int

Types::Peer

111

111

1

+HandleReceivedMessage()+CreateMessage()

MsgHandlers::UserAckHandler

1111

+CalcTopicHash()+GetTsFromHash()+GetTsFromTopic()+GetNextTsInRing()

CSSpecific::FindTs

TSSpecific::TsServer


MsgHandlers::StatsHandler


MsgHandlers::CacheHandler


TsNotifyHandler


MsgHandlers::TsRequestHandler


MsgHandlers::ReplicationHandler

+ChooseTs()

TSSpecific::TsLoadBalancer

+SearchCache()+UpdateCache()+RemoveCache()+Print()

-Hashtable : struct

TSSpecific::TopicCache

+IsTopicSelfOwned()+AddTopic()+RemoveTopic()+SubscribeUserToTopic()+UnsubscribeUserToTopic()+AddTopicReplica()+GetTopicSubscriberList()+GetLoad()+Print()

-SqlConnection : struct

TSSpecific::TopicDatabase

+UpdateUser()+GetUser()+RemoveUser()+GetTotalUsers()+Print()

-SqlConnection : struct

TSSpecific::UserDatabase

111

+Init()+Cleanup()+ReloadConfig()+Start()+ThreadMain()


General::Module


-ServerType : int



IOStack::TcpIO

+StartServer()+DoSendOperation()+DoReceiveOperarion()

IOStack::UdpIO

1

1


-ServerSocket : int

IOStack::IO

+RegisterMessageHandler()+SendMessage()+HandleSendCompleted()+HandleServerStarted()+HandleNewConnection()+HandleHeaderReceived()+HandleBodyReceived()+HandleDatagramReceived()+HandleIOFailure()

-MessageHandlers : MessageHandler[]

IOStack::ProtocolHandler

1


MsgHandlers::LoadHandler


MsgHandlers::PingHandler

+SendMessage()+HandleReceivedMessage()+HandleNewPeer()+HandleIOFailure()+CreateMessage()+CreateMessageTail()

-SenderId : long

IOStack::MessageHandler

+GetNumRooms()+GetNumServersInRoom()+GetServerByIndex()

-Servers : struct-PingInterval : int

General::ServerDb

+GetIdFromPeer()+GetAddressFromPeer()+Add()+Remove()+GetPeerFromId()+GetPeerFromAddress()

General::PeerDb

-RoomIndex : int-IndexInRoom : int

Types::Server

-Socket : int-Address : struct-Id : int-State : int

Types::Peer

111

111

1

CSSpecific::CsServer


MsgHandlers::LoginHandler


MsgHandlers::CsRequestHandler


MsgHandlers::CsNotifyHandler


MsgHandlers::UserAckHandler


MsgHandlers::TsNotifyHandler

1111

+Add()+Remove()+AllocClient()+FreeClient()+GetLoad()

-ClientPingInterval : int-Clients : Client[]

CSSpecific::ClientDb

+CalcTopicHash()+GetTsFromHash()+GetTsFromTopic()+GetNextTsInRing()

CSSpecific::FindTs

-GeoParam : int

Types::Client

1

Stress Testing

client load test 2

0.00

200.00

400.00

600.00

800.00

1000.00

1200.00

1400.00

1600.00

1800.00

1471013161922252831

client load (K)

turn

aro

un

d t

ime

(m

s)

סידרה1

Measure publish-notify turnaround time

1 ms resolution using MM timer, avg. of 30

Increasing client and/or topic loadSeveral room topologies examinedResults:

• Exponential-like climb• TS increase: better times • CS increase: better max clients time not improved

SPUD A Distributed High Performance Publish-Subscribe Cluster

Documents