When small problems become big problems

When small problems become big problems

@adrianfcole

Agenda

• Introduction to CloudHub

• Challenges we faced building multi-tenant architecture

• Q/A

Adrian Cole (@jclouds) founded jclouds march 2009 cloudhub.io architect at cloudhub.io architect at cloudhub.io architect at

Ego slide

4

56

Platform as a Service

Automated ProvisioningEvent TrackingCentralized LoggingSecure Data Gateway

The landlord’s dilemma

When you’ve priced yourself out of

business

Cloud is utility, but your service may be

more• Measurement based pricing exists in

infrastructure tier

• Know your customer, who are they and where in the value chain you act

• Don’t get into race to the bottom

When 200 users becomes 2000

accounts

Choosing a BASIC starting point

• Already had a LDAP infrastructure

• Straightforward integration with console and other access tools

• Easy to do do BASIC authentication

Remember users (and api users)(and api users)

• Basic Auth is not a good choice for an API over time

• System integrators need delegated access

• Hard to cleanup accounts when there are multiple owners

When myapp.cloudhub.io

becomesmyapp001.cloudhub.iomyapp001.cloudhub.io

How to present the iApps

• X.cloudhub.io

• DNS is flexible to deal with

• clear branding

X.cloudhub.io woes

• Namespace contention

• qa.cloudhub.io isn’t really an iApp

• need to maintain blacklist

When mule isn’t mule

PaaS is more than java -jar mule.jar

• CloudHub adds services integration to Mule

• Logging, Event Tracking, Replay, etc.

appstack -> platform is tricky

• transparent features and also compatible?

• dealing with network streams that could be more brittle

• matching serialization/marshalling w/ cloud features like streaming

When SLA turns into refund

Desire to rely on more services

• Cloud Infrastructure

• Cloud Search

• Cloud Scaling

Reality of relying on more services

• uptime is less the more service dependencies you add

• services may underperform their SLAs with little financial impact

• you may need to manually deal with service outages

When logging turns into a big data

problem

Customers desire real time search

• need to centralize and index logs

• using ElasticSearch can avoid service fees or license fees

• with a custom logging plugin, we can redirect output to the cluster

Logging is always a big problem

• Clusters can fail for reasons beyond servers deployed

• API design for logging is different

• What happens if your disk fails or your cluster fails?

• What happens when you replace a worker?

Real men test in production

Testability is crucial

• each dependency needs to be testable and mockable

• devs need a local environment that matches, or your test cases will suffer

• creation of new tenants means more money.. test it!

Platform testing is really hard

• Some external deps don’t have sandboxes

• Can you try 500 applications?

• Can you maintain a quiet production “neighborhood" while testing QA

When security updates = vi ipsec.conf in for

loop

Security in a public service is hard

• assume user is infinitely clever and malicious

• deny by default vs service simplicity

• maintain segregation and availability of tenants

• Asset value can vary widely across tenants

Security design touches everything

• ipsec is hard to maintain without proper CM, and wasn’t built for noisy network

• deny by default means higher maintenance, and not all products support it

• it is easy to violate tenancy segregation in a platform

• you may have to hire consultants

When your management service

goes haywire

automation automation automation

• myriad of technology to automate scaling and availability

• policies can be fine tuned to relaunch or scale out based on system feedback or api

What about network splits

• Will your management server “heal” something that is already around?

• Is your management server on the same failure plane as your managed servers

• Will you end up with manual intervention controls (aka red button)

When your api design haunts you

Put an API on everything

• Allows automation and guis besides what you’ve invented

• simplifies testing

• eat your own dogfood

Design redo is a big problem

• GUIs can change easier as humans drive them

• Maintaining old apis may not be worth it

• People may depend on bugs or semantic gaps

• Version practices in ReST are not uniform

• remember understanding state machine is a prerequisite for HATEOAS

When 5 retries becomes a DDoS

attack

We want to build resilient apps

• recovery is a part of the service you provide, more important as you go up in value chain

• connections should assume failure and be able to reconnect to dependencies

• recovery is non-trivial

5 retries is code smell

• things that backup or fail can get worse with naive error retry loops

• APIs often can be made to include data about when to retry or that you need to slow down

• Treat resilience as a requirement, not a feature

When your users ask the same questions

Wrong words suck

• Some terms seem sensible in design discussions, but public use something else

• Changing requires retraining, and thorough doc review

• What goes online lingers

When a feature request implies new

architecture

• Customers are looking for service, not explanations of why it is hard

• Adding value implies touch decisions on new features

• As the world turns, expectations rise

• Know your customer

Platform changes

• Not all databases support full-text search, esp with partitioning

• Some data is better stored in S3, how does that affect indexing strategy?

• Real-time tools are emerging but immature

Real-time, full-text search, streaming.. oh

my!

When you end up with a “lock” table in

mongo

Datastore diversity!

• NoSQL datastores like Mongo are attractive and energize developers

• Cloud provisioners like RDS-driven MySQL are also attractive

• Specialized stores like CloudWatch for statistics

Don’t expect mongo to do magic

• Database Engines Mature

• Consistent backups are tricky and only recently supported

• Data Ops and visualization tools are emerging

• There are type safe bridges like Morphia

Hammers and screwdrivers

• In a pinch, you can knock in a screw with a hammer, but you can’t screw in a nail with a screwdriver

• Don’t throw data into whatever store happens to be easy to grab, even if you can.

• Rechecking data assumptions at T1 is better than T3. At T6, you may a disaster

Summary

When developing a multi-tenant platform

• Own your dependencies or they will own you

• Add time for entropy

• Repeatedly remind yourself you are a landlord

Architecture as iterative development

• Forethought

• Critical debate

• Decision review

‣ @adrianfcole

‣ [email protected]

‣ www.cloudhub.io

When small problems become big problems

Technology

logging turns

api users basic auth

butyour service

public service

platform testing

custom logging plugin

service fees orlicense

onmore services uptime