Microservices at Mercari Current status and challenges
Microservices at MercariCurrent status and challenges
Taichi Nakashima (@deeeet/@tcnksm)
SRE at Mercari, automation obsessed, gopher
SRE mission at Mercari
● To ensure a reliable service that is enjoyable to use at anytime● Takes care of all engineering apart from new service development
○ Performance improvement, automation, security etc
Current Mercari architecture
nginx
HTTP
API API API
MySQL MySQL
solr solr solr
Cache
Simple 3 tiler + α architecture
Single code base
Current Mercari architecture
Same architectureIn 3 region
JPUS
UK
Positive
● A central ops team (SRE) can efficiently handle
Challenges
Challenges
nginx
HTTP
API API API
MySQL MySQL
solr solr solr
Cache
Simple 3 tiler + α architecture
Single code base
Challenges
nginx
HTTP
API API API
MySQL MySQL
solr solr solr
Cache
Simple 3 tiler + α architecture
Monolith?
Challenges
● Code is too huge/complex to understand ● Team is too large to efficiently work on shared code base● Communication overhead is too large ● Velocity (development cycle) is stalled...
Microservices
Microservices?
● Architectural and organizational approach to software development○ To speed up deployment cycles○ Foster innovation and ownership ○ Improve maitainability and scalability
Microservices?
$ cat inside.txt | cut -f 1 -d ' ' | sort | uniq -c | sort -nr
Microservices
● Do one thing well ○ Unix philosophy○ One function in one service, not multiple functions in one service
● Decentralized Governance○ Each team has ownership on each service
● Independent○ Each service can be changed, upgraded, or replaced independently
● Polyglot○ Right framework and tool for each domain
Goal
● Software Engineer○ Without velocity stalled, rather make feature improvement iteration speed fast ○ -> Provide great features to customers faster
● SRE ○ Provide automated platform for microservice ○ Give some responsibility (e.g., deployment, debug) to software engineering○ -> Focus on more SRE related software engineering task
Team
@deeeet @spensnova @babarot
State of microservices in US
Microservices architecture in US
Mercari API
HTTP
Microservices architecture in US
Gateway API
Mercari API
HTTP
HTTP
Microservices architecture in US
Gateway API
Mercari API
HTTP
offer
HTTP
gRPC
Microservices architecture in US
Gateway API
Mercari API
HTTP
search offer
HTTP
gRPC
Microservices architecture in US
Gateway API
Mercari API
HTTP
search
personalization
offer
HTTP
gRPC
Technical stacks
● Docker● Kubernetes (Google Container Engine) ● gRPC
Container
● Resource isolation● Resource limitation● Fast boot (vs. VM)
Docker
● Easy to build container image● Easy to distribute via registry
Why Docker?
● Software engineer control more○ They can include what they want (e.g., runtime, library)
● Environmental parity○ What works on local development (or QA env) is exact same (easy to debug)○ No more “it works on my environment but not in production!”
● Easy to deploy○ Docker image ≒ Single static linked binary○ You already know its benefit if you use Go
Kubernetes (GKE)
● Container orchestration● Derives from Google internal
system named Borg & Omega● Inspired and informed by
Google’s experiences and internal systems
Why kubernetes?
● Best way to maximize container benefit○ Resource isolation/limitation enables us compute resource utilization. But how?
■ K8s can correctly schedule container proper instances○ How to communicate between dynamically scheduled containers?
■ K8s provide the service discovery
● Reduce operation costs ○ Self healing & auto scaling
● Infrastructure of infrastructure○ Industrial standard https://githubengineering.com/kubernetes-at-github○ More tools/software comes top on k8s in future (I guess)
gRPC
● gRPC Remote Procedure Call● High performance, general
purpose, open source, standards-based, RPC framework
● Open source version of stubby RPC in used in Google
gRPC
● Simple service definition○ By default, gRPC uses protocol buffers as the Interface Definition Language (IDL) for
describing both the service interface and the structure of the payload messages.
● Works across languages and platforms○ Write golang server and python client○ Utilize polyglot microservices
Why not REST?
● Who can implement REST correctly?○ High cost to design (Path? Parameters? hah?)○ Eventually it’s just HTTP endpoints
● No more HTTP client implementation ..
Challenges
Challenges
● Deployment ● Observability
Deployment
● Deployment is key in microservices platform○ “Without velocity stalled, rather make iteration speed faster”
● We need easy & safe automated deployment system○ We started chatbot style deployment but it was not scale
Spinnaker
● Continuous Delivery platform● Developed in Netflix
○ Worked with Google and open sourced in 2015
● Support multi cloud○ Kubernetes!, GCE, AWS
Spinnaker GUI
Spinnaker pipeline
Why Spinnaker?
● Kubernetes support● Built-in deployment best practice from Netflix and Google
○ Immutable infrastructure○ Blue/Green deployment, Canary deployment○ Manual judgement (by manager) phase○ Run integration tests
Spinnaker in Mercari
● Currently only for container deployment to kubernetes● Each team uses spinnaker to deploy their own services● One spinnaker handles all microservices in all region
Example pipeline of API gateway deployment (Canary)
One spinnaker cluster manages Mercari global deployment
JPUS
UK
Future of spinnaker
● Pipeline as a Code○ https://github.com/spinnaker/dcd-spec
● Automated canary analysis
Automated canary analysis
https://blog.spinnaker.io/can-i-push-that-building-safer-low-risk-deployments-with-spinnaker-a27290847ac4
Observability
Observability (logging, metrics & tracing) is important
● Each team needs to debug service by themselves without SSH● It’s harder and more complex than monolith
Stackdriver logging
Request ID in log
● Which service caused problem in one request?
Request ID in log
Gateway API
Mercari API
HTTP
search
personalization
offer
HTTP
gRPC
① Generate unique ID
② Annotate log by the ID in same request
HTTP headergRPC metadata
Request ID in log
Search by request ID
Log from gateway
Log from service X
Distributed tracing
● Which services makes the request slow?
Stackdriver tracing
Metrics
Selection of metrics service/software is still on-going discussion & trial
● First support of container and kubernetes ● Integration with kubernetes ecosystem
○ Spinnaker, istio and so on
● Service dependency visualization
Prometheus + grafana
Datadog
Instana
State of microservices in JP
State of microservices in JP
JP is just started
● Some services (Machine learning product) are started to containerized and deployed on GKE
● On-going discussion about the best architecture
Conclusion
● Why we started microservices?● Current state of US microservices and challenges
We’re hiring
● Who loves automation● Technical keywords
○ Docker○ Kubernetes○ gRPC○ Golang ○ Container monitoring
Spinnaker is deployed on GKE
Testing
Testing in microservice is hard?
● Focus on unit tests as usual○ Because each service is supposed to independent ○ Each microservices must measure testing coverage
● Integration tests?○ Use mock instead of working hard for preparing local env
Testing pyramid
Google Testing Blog: Just Say No to More End-to-End Tests
Do this a lot !
Do mock
QA environment
How to test development feature from QA device?
● Pull request (PR) based pod creation
PR based pod creation
Proxy API gateway (master)
API gateway (PR 313)
API gateway (PR 314)
Proxy by PR number
Set RP number
Container is deployed via CI
PR based docker container (QA env)
Easy to switch
PR based pod creation
Proxy API gateway (master)
API gateway (PR 313)
API gateway (PR 314)
Service A (master)
Service A (PR 21)
Proxy by PR number
Set RP number
Container is deployed via CI
Future works
Service mesh
Don’t trust each other!
● Traffic management○ API rate limit, circuit breaker
● Policy enforcement○ Ensure access policies (which service can access which service?)
We should realize above without modifying client/server code!
Service mesh (Istio)
Chaos engineering
● Real world is hard … ○ machine is crashed, network is unstable (especially in distributed system)
● Dependent service fails anytime
Chaos engineering
● Service must be fault tolerance whenever something wrong● Emulate real world problem
○ We need to identify weaknesses ■ Improper fallback settings when a service is unavailable
○ Software Engineer should be aware
Chaos engineering (Chaos monkey)
https://github.com/Netflix/chaosmonkey