How built a framework to improve infrastructure resource utilization at scale
Jan 23, 2018
How built a framework to improve infrastructure resource utilization at scale
★ Sr. Systems Engineer @Twitter★ Proud being a member of @TwitterWomen,
@Techwomen and @WomenWhoCode
I am @VinuCharanya
Hello!
1 2 3 4
History & ContextChargeback @TwitterKite - Service Lifecycle ManagerImpact & Future Work
Agenda
History & Context
Thousands of MicroServices
Thousands of MicroServices
Thousands of MicroServices
INFRASTRUCTURE & DATACENTER MANAGEMENT
CORE APPLICATION SERVICES
TWEETS
USERS
SOCIAL GRAPH
PLATFORM SERVICES
SEARCH
MESSAGING & QUEUES
CACHE
MONITORING AND ALERTING
INGRESS & PROXY
FRAMEWORK/
LIBRARIES
FINAGLE(RPC)
SCALDING(Map Reduce in
Scala)
HERON(Streaming Compute)
JVM
MANAGEMENT
TOOLS
SELF SERVE
SERVICE DIRECTORY
CHARGEBACK
CONFIG MGMT
DATA & ANALYTICSPLATFORM
INTERACTIVE QUERY
DATA DISCOVERY
WORKFLOWMANAGEMENT
INFRASTRUCTURESERVICES
MANHATTAN
BLOBSTORE
GRAPHSTORE
TIMESERIESDB
STORAGE
MESOS/AURORA
HADOOP
COMPUTE
MYSQL
VERTICA
POSTGRES
DB/DW
DEPLOY(Workflows)
MESOS/AURORA
HADOOP
MANHATTAN
67%N
umbe
r of S
erve
rs
Number of Servers
MESOS/AURORA
HADOOP
MANHATTAN
67%How to get visibility into resources used by
individual jobs & datasets?
Number of Servers
MESOS/AURORA
HADOOP
MANHATTAN
67%How to attribute resource consumption
to teams/organization?
Number of Servers
MESOS/AURORA
HADOOP
MANHATTAN
67%How do you incentivize the right behavior to
improve efficiency of resource usage?
Chargeback @Twitter
Chargeback @Twitter
Ability to meter allocation & utilization of resources
Chargeback @Twitter
Ability to meter allocation & utilization of resources per service, per project, per engineering team
Chargeback @Twitter
Ability to meter allocation & utilization of resources per service, per project, per engineering team to improve visibility & enable accountability
Features
Supports diverse Infra Services
Chargeback @Twitter
18
Meters abstract resources at daily
granularityDetailed Reports
19
Chargeback @Twitter
1. Resource Catalog: Consistent way to inventory infrastructure resources
Support diverse Infrastructure and Platform Services
20
Chargeback @Twitter
1. Resource Catalog: Consistent way to inventory infrastructure resources
• Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource
Support diverse Infrastructure and Platform Services
21
Chargeback @Twitter
1. Resource Catalog: Consistent way to inventory infrastructure resources
• Resource Fluidity: Support primitive (CPU) and abstract resource (“Tweets / second”). Extend existing resource
2. Resource <> Client Identifier Ownership: Map of client identifier to an owner to enable accountability
Support diverse Infrastructure and Platform Services
OFFER MEASURE COST
RESOURCE CATALOG ENTITY MODEL
OFFER MEASURES
OFFER MEASURE COST
1:N
RESOURCE CATALOG ENTITY MODEL
PROVIDER
INFRASTRUCTURE SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N
RESOURCE CATALOG ENTITY MODEL
TWITTER DC/PUBLIC CLOUD
COMPUTE
CORE-DAYS
$X
PROVIDER
INFRASTRUCTURE SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N
RESOURCE CATALOG ENTITY MODEL
TWITTER DC/PUBLIC CLOUD
COMPUTE
CORE-DAYS
$X
PROVIDER
INFRASTRUCTURE SERVICE
OFFERINGS
OFFER MEASURES
OFFER MEASURE COST
1:N
1:N
1:N
1:N
TWITTER DC
STORAGE
GB- RAM
PROCESSING CLUSTER
FILEACCESSES
…
…GB- RAM
FILE ACCESSE
S… …
$X $Y …$M $N… …
RESOURCE CATALOG ENTITY MODEL
{ measures: [{"measure_id": 1,"measure_label": "core-days","measure_unit_label": "per 1 core-day","offering_id": 1,"offering_label": "Compute","infrastructure_id": 1,"infrastructure_name": "Aurora"
},
{"measure_id": 2,"measure_label": "machine-days","measure_unit_label": "per 1 machine-day","offering_id": 2,"offering_label": “zone:tweety","infrastructure_id": 8,"infrastructure_name": "Physical Infrastructure",
},
{
/api/1/measures
Chargeback @Twitter
So, how do you incentivize the right behavior to improve efficiency of resource usage?
Pricing is one way…
Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server($X / day) Total available Cores
Quota Buffer(Underutilized Quota)
Container Size Buffer(Underutilized Reservation)
Total Cost of Ownership for Aurora$X core-day
Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server($X / day) Total available Cores
Quota Buffer(Underutilized Quota)
Container Size Buffer(Underutilized Reservation)
Total used Cores
Total Cost of Ownership for Aurora$X core-day
Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server($X / day) Total available Cores
Quota Buffer(Underutilized Quota)
Container Size Buffer(Underutilized Reservation)
Total used Cores
Excess Cores (incl. DR, Spikes, Overallocation)Total Cost of Ownership for Aurora
$X core-day
Operational Overhead
Headroom
Production Used Cores
Non-Prod Used Cores
Cost of Physical Server($X / day) Total available Cores
Quota Buffer(Underutilized Quota)
Container Size Buffer(Underutilized Reservation)
Total used Cores
Excess Cores (incl. DR, Spikes, Overallocation)
Cores used by platformfor operations &
maintenance
Total Cost of Ownership for Aurora$X core-day
Features
Supports diverse Infra/Platform
Services
Chargeback @Twitter
34
Meters abstract resources at daily
granularityDetailed Reports
35
Chargeback @Twitter
INFRASTRUCTURE SERVICE 1
INFRASTRUCTURE SERVICE 2
INGESTMETRICS
RAWFACT TRANSFORMER RESOLVED
FACT
RESOURCE CATALOG
REPORT
REPORT
Metering Pipeline (ETL Job)
IDENTIFIER OWNERSHIP
MAPPING
Metrics Ingestor
DATA FIDELITY
Metering Pipeline (ETL Job)
36
Chargeback @Twitter
INFRASTRUCTURE SERVICE 1
INFRASTRUCTURE SERVICE 2
INGESTMETRICS
RAWFACT TRANSFORMER RESOLVED
FACT
RESOURCE CATALOG
REPORT
REPORT
Metering Pipeline (ETL Job)
IDENTIFIER OWNERSHIP
MAPPING
Schema(client_identifier, offering_measure, volume, metadata, timestamp)
DATA FIDELITY
Metering Pipeline (ETL Job)
37
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE SERVICE 1
INFRASTRUCTURE SERVICE 2
INGESTMETRICS
RAWFACT TRANSFORMER RESOLVED
FACT
RESOURCE CATALOG
IDENTIFIER OWNERSHIP
MAPPING
REPORT
REPORT
Transformer
DATA FIDELITY
Metering Pipeline (ETL Job)
38
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE SERVICE 1
INFRASTRUCTURE SERVICE 2
INGESTMETRICS
RAWFACT TRANSFORMER RESOLVED
FACT
RESOURCE CATALOG
IDENTIFIER OWNERSHIP
MAPPING
REPORT
REPORT
1. Resolve Ownership
DATA FIDELITY
Metering Pipeline (ETL Job)
39
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE SERVICE 1
INFRASTRUCTURE SERVICE 2
INGESTMETRICS
RAWFACT TRANSFORMER RESOLVED
FACT
RESOURCE CATALOG
IDENTIFIER OWNERSHIP
MAPPING
REPORT
REPORT
2. Cost Computation
DATA FIDELITY
Metering Pipeline (ETL Job)
40
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE SERVICE 1
INFRASTRUCTURE SERVICE 2
INGESTMETRICS
RAWFACT TRANSFORMER RESOLVED
FACT
RESOURCE CATALOG DATA FIDELITY
REPORT
REPORT
IDENTIFIER OWNERSHIP
MAPPING
Data Fidelity & Reporting
Metering Pipeline (ETL Job)
41
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE SERVICE 1
INFRASTRUCTURE SERVICE 2
INGESTMETRICS
RAWFACT TRANSFORMER RESOLVED
FACT
RESOURCE CATALOG
REPORT
REPORT
IDENTIFIER OWNERSHIP
MAPPING
1. Verify Data Integrity & Fidelity
DATA FIDELITY
Metering Pipeline (ETL Job)
42
Chargeback @Twitter
Metering Pipeline (ETL Job)
INFRASTRUCTURE SERVICE 1
INFRASTRUCTURE SERVICE 2
INGESTMETRICS
RAWFACT TRANSFORMER RESOLVED
FACT
RESOURCE CATALOG
REPORT
REPORT
IDENTIFIER OWNERSHIP
MAPPING
2. Alert when things don’t seem the way it should be
DATA FIDELITY
Metering Pipeline (ETL Job)
43
Chargeback @Twitter
INFRASTRUCTURE SERVICE 1
INFRASTRUCTURE SERVICE 2
EXPORTMETRICS
RAWFACT TRANSFORMER RESOLVED
FACT
RESOURCE CATALOG
IDENTIFIER OWNERSHIP
DATA FIDELITY
REPORT
REPORT
Metering Pipeline (ETL Job)
Features
Supports diverse Infra/Platform
Services
Chargeback @Twitter
44
Meters abstract resources at daily
granularityDetailed Reports
45
Chargeback @Twitter
Customers
Infrastructure & Platform Operators Overall Cluster GrowthAllocation v/s Utilization of resources by Client/Tenant
Finance & Execs Budget v/s Spend per OrgInfrastructure PnLOverall Efficiency & Trends
Service Owners & Developers Team BillPer Service Allocation vs. Utilization of Resources
Reports
Customers
Infrastructure & Platform Operators Overall Cluster GrowthAllocation v/s Utilization of resources by Client/Tenant
Finance & Execs Budget v/s Spend per OrgInfrastructure PnLOverall Efficiency & Trends
INFRASTRUCTURE PNL
47
Chargeback @Twitter
Customers
Infrastructure & Platform Operators Overall Cluster GrowthAllocation v/s Utilization of resources by Client/Tenant
Finance & Execs Budget v/s Spend per OrgInfrastructure PnLOverall Efficiency & Trends
Service Owners & Developers Team BillPer Service Allocation vs. Utilization of Resources
Reports
CHARGEBACK BILL FOR A TEAM
CHARGEBACK DRILLDOWN FOR A TEAM
Features
Supports diverse Infra/Platform
Services
Chargeback @Twitter
50
Meters abstract resources at daily
granularityDetailed Reports
51
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data Fidelity
Accurate Ownership Mapping
Logical grouping of resources
Track historical data
• Trust in data is most important.
• Invest in monitoring & alerting for data inconsistencies
• Leverage this for detecting abnormal increase/decrease and notify users
• Static mappings go out of date quickly
• Invest in systems (ex, Kite) for users to manage it themselves
• Identifiers were too granular and teams were too broad.
• Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain
• Unit prices change over time
• Orgs / Teams change over time
• Resources get added / removed
• Change history is essential for consistency which is used for CAP planning
52
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data Fidelity
Accurate Ownership Mapping
Logical grouping of resources
Track historical data
• Trust in data is most important.
• Invest in monitoring & alerting for data inconsistencies
• Leverage this for detecting abnormal increase/decrease and notify users
• Static mappings go out of date quickly
• Invest in systems (ex, Kite) for users to manage it themselves
• Identifiers were too granular and teams were too broad.
• Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain
• Unit prices change over time
• Orgs / Teams change over time
• Resources get added / removed
• Change history is essential for consistency which is used for CAP planning
53
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data Fidelity
Accurate Ownership Mapping
Logical grouping of resources
Track historical data
• Trust in data is most important.
• Invest in monitoring & alerting for data inconsistencies
• Leverage this for detecting abnormal increase/decrease and notify users
• Static mappings go out of date quickly
• Invest in systems (ex, Kite) for users to manage it themselves
• Identifiers were too granular and teams were too broad.
• Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain
• Unit prices change over time
• Orgs / Teams change over time
• Resources get added / removed
• Change history is essential for consistency which is used for CAP planning
54
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data Fidelity
Accurate Ownership Mapping
Logical grouping of resources
Track historical data
• Trust in data is most important.
• Invest in monitoring & alerting for data inconsistencies
• Leverage this for detecting abnormal increase/decrease and notify users
• Static mappings go out of date quickly
• Invest in systems (ex, Kite) for users to manage it themselves
• Identifiers were too granular and teams were too broad.
• Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain
• Unit prices change over time
• Orgs / Teams change over time
• Resources get added / removed
• Change history is essential for consistency which is used for CAP planning
55
1 2 3 4
Learnings
Chargeback @Twitter
Invest in data Fidelity
Accurate Ownership Mapping
Logical grouping of resources
Track historical data
• Trust in data is most important.
• Invest in monitoring & alerting for data inconsistencies
• Leverage this for detecting abnormal increase/decrease and notify users
• Static mappings go out of date quickly
• Invest in systems (ex, Kite) for users to manage it themselves
• Identifiers were too granular and teams were too broad.
• Find a good middle ground and invest in system (ex, Kite) to track, understand and maintain
• Unit prices change over time
• Orgs / Teams change over time
• Resources get added / removed
• Change history is essential for consistency which is used for CAP planning
SERVICE IDENTITY MANAGER
RESOURCE PROVISIONING MANAGER
DASHBOARD (SINGLE PANE OF GLASS)
REPORTING
INFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE SERVICEINFRASTRUCTURE & PLATFORM SERVICE
SERVICE LIFECYCLE WORKFLOWS
METADATA RESOURCE QUOTA MANAGEMENT
METERING & CHARGEBACKCLIENT IDENTITY
PROVIDER APIS & ADAPTERS
10,000+ Client Identifiers 1,000+ Projects 100+ Teams 8 Infrastructure Services
58
Kite @Twitter
59
Kite @Twitter
Identity System: Built a consistent way to group client identifiers of different infrastructure services into a project and enabled ownership
• Capture Org Structure: Support org structure changes, project transfer workflows to ensure up-to-date ownership of identifiers
• Unify client identifier provisioning workflow: Enables single source of truth and reduces operator pain around provisioning and managing client identifiers.
Client Identifier Management
IDENTITY ENTITY MODEL
<INFRA, CLIENTID> <Aurora, tweetypie.prod.tweetypie>
<Aurora, ads-prediction.prod.campaign-x>
IDENTITY ENTITY MODEL
SERVICE/SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
tweetypie
<Aurora, tweetypie.prod.tweetypie>
ads-prediction
<Aurora, ads-prediction.prod.campaign-x>
BUSINESS OWNER
TEAM
PROJECT
SERVICE/SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
1:N
1:N
1:N
INFRASTRUCTURE
TWEETYPIE
tweetypie
tweetypie
<Aurora, tweetypie.prod.tweetypie>
ADS PREDICTION
prediction
ads-prediction
<Aurora, ads-prediction.prod.campaign-x>
REVENUE
IDENTITY ENTITY MODEL
BUSINESS OWNER
TEAM
PROJECT
SERVICE/SYSTEM ACCOUNT
<INFRA, CLIENTID>
1:N
1:N
1:N
1:N
INFRASTRUCTURE
TWEETYPIE
tweetypie
tweetypie
<Aurora, tweetypie.prod.tweetypie>
ADS PREDICTION
prediction
ads-prediction
<Aurora, ads-prediction.prod.campaign-x>
REVENUE
IDENTITY ENTITY MODEL
Entities are time varying dimensions
Impact
10,000+ Client Identifiers
CLAIM OWNERSHIP
PROJECT DISCOVERY
TEAM OVERVIEW
TEAM OVERVIEW
Released unused Resources
TEAM OVERVIEW
Q2 unit price update
TEAM OVERVIEW
New project launch
PROJECT METADATA
AURORA QUOTA MANAGER
Future Work
75
Future Work
Impact & Future Work
1 2Resource provisioning
Enable project deprecation
• Extend Quota Manager and unify the experience into Kite
• Onboard Hadoop, Storage and other systems
• Detect unused resources, notify users, trigger deprecation process based on policy
3Capacity Planning
• Provide historic trends and help with forecast of capacity
76
1 2
Future Work
Impact & Future Work
Resource provisioning
Enable project deprecation
• Extend Quota Manager and unify the experience into Kite
• Onboard Hadoop, Storage and other systems
• Detect unused resources, notify users, trigger deprecation process based on policy
3Capacity Planning
• Provide historic trends and help with forecast of capacity
77
1 2
Future Work
Impact & Future Work
Resource provisioning
Enable project deprecation
• Extend Quota Manager and unify the experience into Kite
• Onboard Hadoop, Storage and other systems
• Detect unused resources, notify users, trigger deprecation process based on policy
3Capacity Planning
• Provide historic trends and help with forecast of capacity
79
1 2
Future Work
Impact & Future Work
Resource provisioning
Enable project deprecation
• Extend Quota Manager and unify the experience into Kite
• Onboard Hadoop, Storage and other systems
• Detect unused resources, notify users, trigger deprecation process based on policy
3Capacity Planning
• Provide historic trends and help with forecast of capacity
@VinuCharanya