Active/Active Payments Processing at Square Ted Mao and Jiang-Ming Yang October 2014
Jul 27, 2015
Active/Active Payments Processing at SquareTed Mao and Jiang-Ming Yang
October 2014
Active/Active!
What
• Resilient to datacenter-level failure
• Resilient to Internet routing problems
• Transparent to the merchant
• No human intervention
!
Why
• Every second of uptime matters to our merchants. Goal is 5 9s.
• Much easier and safer to perform datacenter-level maintenance.
Challenges!
Inconsistent state between datacenters
Datacenters can’t tell if a transaction has already been processed elsewhere.
!
Limited idempotence
Payment networks can’t reliably guarantee idempotence on retries.
!
Real-time latency requirements
We can’t just wait until our datacenters get in sync.
!
!
!
Concepts
Client idempotence key
Concepts
Client idempotence key Server transaction
Concepts
Client idempotence key Server transaction Transaction progression
Card Processing Multi-DC resolution
Multi-Tender Multi-DC challenge
Scenario
When Merchant try to sell items/products to customers, customers will have the option to pay with multiple tenders.
!
APIs
1. 1. CreateBill
2.2. AddTender
3.3. CompleteBill / CancelBIll
!
Challenges
1. 1. Each time we receive a tender request, we need to process this tender immediately. Thus different tenders for the same bill may be processed at different data centers.
2.2. When receiving the CompleteBill request, we may need to wait for the tender information from remote data center.
Multi-Tender Multi-DC resolution
Multi-Tender Multi-DC resolution
State Machine
Tender state machine
!
!
!
!
Bill state machine
!
!
!
Correctness
1. 1. A formal proof
2.2. Simulate all the possible operational combinations and verify the results
Caveats
Eventually consistent
Asynchronous, eventually consistent
systems are harder to reason about.
!
Complex
Active/active systems are harder to
design, implement, and test.
!
Data Loss
If the original data center is down and
never comes back, we may not be able
the perform the capture due to the loss
of original auth.
!
Downstream effects
Not all downstream effects are
reversible.
Future Plans!
We want a storage solution with the following properties:
1. Horizontally scalable
2. Tolerant to DC failure
3. Transactional
!
CockroachDB: a Scalable, Geo-Replicated, Transactional Datastore
!
!
!
!
!
http://cockroachdb.org/
Q&A