When small problems become big problems @adrianfcole
Jul 07, 2015
When small problems become big problems
@adrianfcole
Agenda
• Introduction to CloudHub
• Challenges we faced building multi-tenant architecture
• Q/A
Adrian Cole (@jclouds) founded jclouds march 2009 cloudhub.io architect at cloudhub.io architect at cloudhub.io architect at
Ego slide
4
56
Platform as a Service
Automated ProvisioningEvent TrackingCentralized LoggingSecure Data Gateway
The landlord’s dilemma
When you’ve priced yourself out of
business
Cloud is utility, but your service may be
more• Measurement based pricing exists in
infrastructure tier
• Know your customer, who are they and where in the value chain you act
• Don’t get into race to the bottom
When 200 users becomes 2000
accounts
Choosing a BASIC starting point
• Already had a LDAP infrastructure
• Straightforward integration with console and other access tools
• Easy to do do BASIC authentication
Remember users (and api users)(and api users)
• Basic Auth is not a good choice for an API over time
• System integrators need delegated access
• Hard to cleanup accounts when there are multiple owners
When myapp.cloudhub.io
becomesmyapp001.cloudhub.iomyapp001.cloudhub.io
How to present the iApps
• X.cloudhub.io
• DNS is flexible to deal with
• clear branding
X.cloudhub.io woes
• Namespace contention
• qa.cloudhub.io isn’t really an iApp
• need to maintain blacklist
When mule isn’t mule
PaaS is more than java -jar mule.jar
• CloudHub adds services integration to Mule
• Logging, Event Tracking, Replay, etc.
appstack -> platform is tricky
• transparent features and also compatible?
• dealing with network streams that could be more brittle
• matching serialization/marshalling w/ cloud features like streaming
When SLA turns into refund
Desire to rely on more services
• Cloud Infrastructure
• Cloud Search
• Cloud Scaling
Reality of relying on more services
• uptime is less the more service dependencies you add
• services may underperform their SLAs with little financial impact
• you may need to manually deal with service outages
When logging turns into a big data
problem
Customers desire real time search
• need to centralize and index logs
• using ElasticSearch can avoid service fees or license fees
• with a custom logging plugin, we can redirect output to the cluster
Logging is always a big problem
• Clusters can fail for reasons beyond servers deployed
• API design for logging is different
• What happens if your disk fails or your cluster fails?
• What happens when you replace a worker?
Real men test in production
Testability is crucial
• each dependency needs to be testable and mockable
• devs need a local environment that matches, or your test cases will suffer
• creation of new tenants means more money.. test it!
Platform testing is really hard
• Some external deps don’t have sandboxes
• Can you try 500 applications?
• Can you maintain a quiet production “neighborhood" while testing QA
When security updates = vi ipsec.conf in for
loop
Security in a public service is hard
• assume user is infinitely clever and malicious
• deny by default vs service simplicity
• maintain segregation and availability of tenants
• Asset value can vary widely across tenants
Security design touches everything
• ipsec is hard to maintain without proper CM, and wasn’t built for noisy network
• deny by default means higher maintenance, and not all products support it
• it is easy to violate tenancy segregation in a platform
• you may have to hire consultants
When your management service
goes haywire
automation automation automation
• myriad of technology to automate scaling and availability
• policies can be fine tuned to relaunch or scale out based on system feedback or api
What about network splits
• Will your management server “heal” something that is already around?
• Is your management server on the same failure plane as your managed servers
• Will you end up with manual intervention controls (aka red button)
When your api design haunts you
Put an API on everything
• Allows automation and guis besides what you’ve invented
• simplifies testing
• eat your own dogfood
Design redo is a big problem
• GUIs can change easier as humans drive them
• Maintaining old apis may not be worth it
• People may depend on bugs or semantic gaps
• Version practices in ReST are not uniform
• remember understanding state machine is a prerequisite for HATEOAS
When 5 retries becomes a DDoS
attack
We want to build resilient apps
• recovery is a part of the service you provide, more important as you go up in value chain
• connections should assume failure and be able to reconnect to dependencies
• recovery is non-trivial
5 retries is code smell
• things that backup or fail can get worse with naive error retry loops
• APIs often can be made to include data about when to retry or that you need to slow down
• Treat resilience as a requirement, not a feature
When your users ask the same questions
Wrong words suck
• Some terms seem sensible in design discussions, but public use something else
• Changing requires retraining, and thorough doc review
• What goes online lingers
When a feature request implies new
architecture
• Customers are looking for service, not explanations of why it is hard
• Adding value implies touch decisions on new features
• As the world turns, expectations rise
• Know your customer
Platform changes
• Not all databases support full-text search, esp with partitioning
• Some data is better stored in S3, how does that affect indexing strategy?
• Real-time tools are emerging but immature
Real-time, full-text search, streaming.. oh
my!
When you end up with a “lock” table in
mongo
Datastore diversity!
• NoSQL datastores like Mongo are attractive and energize developers
• Cloud provisioners like RDS-driven MySQL are also attractive
• Specialized stores like CloudWatch for statistics
Don’t expect mongo to do magic
• Database Engines Mature
• Consistent backups are tricky and only recently supported
• Data Ops and visualization tools are emerging
• There are type safe bridges like Morphia
Hammers and screwdrivers
• In a pinch, you can knock in a screw with a hammer, but you can’t screw in a nail with a screwdriver
• Don’t throw data into whatever store happens to be easy to grab, even if you can.
• Rechecking data assumptions at T1 is better than T3. At T6, you may a disaster
Summary
When developing a multi-tenant platform
• Own your dependencies or they will own you
• Add time for entropy
• Repeatedly remind yourself you are a landlord
Architecture as iterative development
• Forethought
• Critical debate
• Decision review