The new Netflix API
Why more complexity must lead to more simplicity
Katharina ProbstDevNexus 2017
Js(mostly)
java
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Netflix Micro-services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary API Server JVM
groovy
Network boundary
Today’s architectureNetwork boundary
Gateway
What is the Netflix
Raison d’Être
Is the API just one gigantic translation layer?
Is it a routing layer?
If it’s too complex, can we just get rid of it?
Raison d’Être.
1. Orchestration
2. Availability protection
3. Abstraction
Raison d’Être
1. Orchestration
Simple example: search
Related Terms
People
Titles
Search request → response● Search services provides related search terms● Search service provides IDs for videos and people
○ IDs depend on various factors, e.g., different catalogs in different countries
● For each ID, we need metadata○ Titles○ Images○ Names○ Ratings○ etc.
● ..., which depend on○ Country○ A/B tests user is in○ etc.
Response:❏ Hydrated videos❏ People names❏ Query suggestions
Orchestration● Own order of operations● Provide whatever info clients/services need
○ From other clients/libraries/services○ From request
● Merge partial results● Filter results● Retrieve more info if necessary● Support mutations (e.g., profile switch)● Support complex transactions in a limited way
2. Availability protection
Prevent this as much as possible
What do customers want?
● No personalized recommendations, or no ability to stream?● No search, or no ability to continue watching the movie you started last night?● No cutting-edge A/B experiment experience, or no ability to stream?
Top priority: customer experience
● Top priority of top priority: customer can stream videos● This means API cannot go down entirely
○ If it does, we have an outage● But some services are not critical to this mission
○ A/B - if we don’t know what A/B tests you’re in, you can still get the default experience
○ Search - if you can’t search, you can still browse
Exposure to failures
● As your app grows, your set of dependencies is much more likely to get bigger, not smaller
● Overall uptime = (Dep uptime)^(num deps)
● Fault-tolerance pattern as a library
● Provides operational insights in real-time
● Automatic load-shedding under pressure
Hystrix
Search client libClient lib B
Ratings client lib
Client lib N
Cust client libClient lib Z
...
...
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
Network boundary
Availability protection
Search
Ratings
Customers
...
Network boundary
Gateway
API
Search client libClient lib B
Ratings client lib
Client lib N
Cust client libClient lib Z
...
...
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
Network boundary
Availability protection
Search
Ratings
Customers
...
Network boundary
Gateway
API
Search client libClient lib B
Ratings client lib
Client lib N
Cust client libClient lib Z
...
...
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
Network boundary
Availability protection
Search
Ratings
Customers
...
Network boundary
Gateway
API
Search client libClient lib B
Ratings client lib
Client lib N
Cust client libClient lib Z
...
...
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
Network boundary
If you don’t plan for failure
Search
Ratings
Customers
...
Network boundary
Gateway
API
Search client libClient lib B
Ratings client lib
Client lib N
Cust client libClient lib Z
...
...
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
Network boundary
If you do plan for failure
Search
Ratings
Customers
...
Network boundary
Gateway
API
No search results >> no Netflix
Search client libClient lib B
Ratings client lib
Client lib N
Cust client libClient lib Z
...
...
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
Network boundary
Fallbacks
Search
Ratings
Customers
...
Network boundary
Gateway
API
Return static or stale rating
return getRatings(id);
How to handle errors
try {
return getRatings(id);
} catch (Exception ex) {
//static value
return null;
}
How to handle errors
try {
return getRatings(id);
} catch (Exception ex) {
//TODO What to return here?
}
How to handle errors
Handle errors with fallbacks
● Some options for fallbacks
○ Static value
○ Value from in-memory
○ Value from cache
○ Value from network
○ Throw
○ Code
● Make error-handling explicit
● Applications have to work in the presence of either fallbacks or rethrown exceptions
● Throttling
● Retries
● Timeouts
● Canaries
● Regional rollouts
● Traffic shifting
● Outlier detection (and elimination)
● Advanced load balancing
Availability protection beyond Hystrix
3. Abstraction
Abstraction goals
● Shield all device teams from every single mid-tier change … at least for a time. Allows us to move more independently
● Shield all device teams from every single platform/infrastructure change● Provide APIs not provided by downstream services
○ Find all movies that...○ Length of movie
● Implementation flexibility, e.g., ○ Caching○ Batch APIs
Abstraction challenges
● Tech debt● Device teams can have black-box view (“api == cloud”)● But isn’t the API team the bottleneck?
○ Yes, sometimes. But organizational structure makes this less of a problem than m mid-tier teams dealing with n device teams
● But: separation of concerns
Server-side logic
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Netflix Micro-services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
~2100 active
Network boundary
Reminder: Today’s architectureNetwork boundary
Gateway
API
Device teams write server-side logic
● Decoupling teams → better velocity● UI teams are empowered to
○ Change presentation○ Filter○ Add users to A/B tests, which then leads to e.g., different layout.
What if we didn’t have an API?
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Netflix Micro-services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
Network boundary
What if? Implications for device teamsNetwork boundary
Gateway
Device teams own client-side applications …
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Netflix Micro-services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
Network boundary
What if? Implications for device teamsNetwork boundary
Gateway
...and groovy scripts
What if? Implications for device teams
● Each device team would have to own○ Orchestration○ Frequent dependency updates (currently done (attempted) daily)○ Implement higher level APIs (all movies that…)○ Fallbacks and other resiliency protection (e.g., timeouts, retries)
● Recent example○ Library upgrade caused a lot of NPEs -- why? ○ Worked with team to find out why○ When fixed, no more NPEs, but instead performance degradation
● Should all teams be dealing with this?
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Netflix Micro-services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
Network boundary
What if? Implications for service teamsNetwork boundary
Gateway
Service teams own services...
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Netflix Micro-services
scripts
scripts
scripts
scripts
...
scripts
scripts
scripts
scripts
Network boundary
Network boundary
What if? Implications for service teamsNetwork boundary
Gateway
...and client libraries
What if? Implications for service teams● Can only make breaking changes if all device teams who use their service
upgrade● Don’t get resiliency protection (e.g., timeouts, load balancing, retries, fallbacks)
unless all device teams who use their service provide it● Should all teams be dealing with this?
What if? Implications for Netflix● Lower velocity due to tight coupling between many mid-tier teams and many
device teams
OR:THE DOWNSIDE OF CENTRALIZATION
Where are we today?
● Principle: don’t repeat logic○ It’s better to do it once in API than do it n times for n devices.
● Principle is good, but leads to complexity
What complexity challenges to we have?
Complexity challenges
● Frequent (not always canaried) updates to a critical service in production● Difficulty of debugging (esp. for groovy script writers)● Slow server startup times● Lack of operational insights into script resource consumption● Difficulty of performance profiling● Lack of feedback loop● Decoupled code versioning and transitive dependencies
Where are we going next?
Top priorities
● Move groovy scripts out● Split up API
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Netflix Micro-services
Network boundary
...
Network boundary
New architecture: Edge PaaSNetwork boundary
Network boundary
Gate-way
EAS
Network boundary Client lib A
Client lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Titus
Network boundary
Network boundary
Netflix Micro-services
Network boundary
...
New architecture: Edge PaaSNetwork boundary
Gate-way
EAS
Network boundary
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Titus
Edge Auth Service● Auth
termination● Centralized
place for auth
Edge PaaS: ● Platform for node scripts● Developer tooling for entire SDLC● Remote API with optimized data access (Falcor)
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Two APIs
DNAClient A
...
Network boundary
...
Network boundary
Two (or more) APIsNetwork boundary
Network boundary
Gate-way
EAS
Network boundary
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Titus
PB Service A
PB Service B
PB Service Z
...
DNAClient B
DNAClient Z
Shared Client C
Shared Client A
...
PB Client B
PB Client Z
PB Client C
PB Service C
DNA Service A
DNA Service B
DNA Service Z
...
DNA Service C
Shared Service A
Shared Service B
Shared Service Z
...
Split API by function
NodeQuark Platform
java
Netflix Micro-services
Network boundary
...
Network boundary
NodeQuark PlatformNetwork boundary
Network boundary
Zuul
EAS
Network boundary
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Titus
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Platform for node scripts
Edge PaaS: Node Platform
● Node apps run in containers on Titus platform● Node Platform provides
○ Integration into Netflix ecosystem (e.g., discovery)○ Logging○ Dashboards, metrics out of the box with option to customize○ Support for mocking and testing
● Titus provides○ Scheduling○ Autoscaling
Developer experience
java
Netflix Micro-services
Network boundary
...
Network boundary
New architecture: Edge PaaSNetwork boundary
Network boundary
Gate-way
EAS
Network boundary
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Titus
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Developer tooling for entire SDLC
Edge PaaS: Developer tooling
● Command line tool for node apps○ Setup○ Starting apps○ Deploying apps
● Local development and debugging of node apps● UI for lifecycle management, e.g., version management● One-click rollouts, one-click rollbacks● Versioning
Remote API
Netflix Micro-services
Network boundary
...
Network boundary
New architecture: Edge PaaSNetwork boundary
Network boundary
Zuul
EAS
Network boundary
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
Node app NodeQuark
TitusRemote API with optimized data access
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Client lib AClient lib BClient lib C
Client lib N
Client lib YClient lib Z
...
...
Edge PaaS: Remote API
● API still takes care of○ Orchestration○ Resiliency protection○ Abstraction
● Optimized access with Falcor○ “RESTful composition” with caching
● Binary transport● Future: channel support
Greater simplicity
Isolated failures: Scripts don’t affect each other (usually)
API
Temporarily unavailable!
Independent root causing
API
Latency spike after push: 150ms
Average latency: 10ms
Independent autoscaling
API
Independent insights
API
Average latency: 50ms
Average latency: 10ms
Better regression/performance testing
API
Tests not affected by other scripts eating up resources on the same JVM
Conclusion
Complexity and simplicity
● Product has become much more complex○ Scripts (more scripts, more complex scripts)○ Features○ Number of downstream services to integrate○ More personalization○ etc.
● Complexity of API service is high → Need to optimize for simplicity now○ Process isolation○ Cleaner developer experience
END