R3 Kickoff Meeting Ocean Observatories Initiative Common Execution Infrastructure (CEI) Subsystem OOI CI System Architecture Team: 1.
Post on 14-Dec-2015
214 Views
Preview:
Transcript
R3 Kickoff Meeting
Ocean Observatories Initiative
Common Execution Infrastructure (CEI) Subsystem
OOI CI System Architecture Team:
1
R3 Kickoff Meeting
CEI Developers
204/18/23
2
CEI DeveloperJohn BresnahanArgonne National Lab(part-time)
CEI DeveloperPatrick ArmstrongUniversity of Chicago
CEI DeveloperPierre RiteauUniversity of Chicago(part-time)
CEI Senior DeveloperPierre RiteauUniversity of Chicago
R3 Kickoff Meeting
Subsystem Purpose
• Allow OOI applications and system to– Provide Highly Available (HA)
services– Scale to demand
• Enact OOI deployment policies in elastic environment
• Provide a deployment foundation for OOI CI
3
R3 Kickoff Meeting
Core System Structure: Service Layers
4
R3 Kickoff Meeting
CEI Scope
• Elastic Computing Services– Implement elastic computing services to provide on-demand scaling and high
availability.
• Execution Engine Catalog & Repository Services– Working with operations and ITV to develop and refine tools to upload and sync the
different deployable type representations adapted to each site.
• Process Management Services– Provide the management services for policy-based process execution within specified
deployable types intended to support the data distribution services; as such the processes are sequential and require primarily a process to resource match.
• Process Catalog & Repository Services– The Process Catalog and Repository Services maintain process definitions as well as
lists active processes.
• Integration with the National Computing Infrastructure– Provide the capability to deploy OOI processing on the Amazon cloud services as well
as academic clouds
5
R3 Kickoff Meeting
High Availability and Scaling
• High Availability– Towards an always-on service model – Failures in outsourced resources– Providing a pool of replenishable compute
resources
• Autoscaling– Provide resources for peaks in demand– Ensure good utilization during “valleys” in
demand– Flexible resource mix
04/18/23
6
R3 Kickoff Meeting
Resources for HA and Scaling
04/18/23
7
EPU ManagementMonitor and regulate set properties
based on system-specific and application-specific metrics
– Cloud resources are available on-demand, but any particular resource may fail at any time
– Applications/processes can absorb new resources– Applications/processes can tolerate failures
EPU
R3 Kickoff Meeting
Managing Resources
8
R3 Kickoff Meeting
EE ioncore 1.3
EPU ManagementEPU ManagementEPU Management
Elastic Processing Unit (EPU) Management
9
EE ioncore 1.2
context-agent
ou-agent
EE matlab 6.1
context-agent
ou-agent
Decision Engine
context-agent
ou-agent
Provisioner
IaaS
create instance
AMQP
OtherDTRS
CB
R3 Kickoff Meeting
Making the EPU HA
ou-agent ou-agent ou-agent
EPU WorkerEPU WorkerEPU WorkerEPU WorkerEPU WorkerEPU Worker
EPU WorkerEPU WorkerEPU Worker
Bootstrap EPU
Dedicated DEProvisioner/DTRS
IaaS
create instance
AMQP
Other
cloudinit.d
R3 Kickoff Meeting
Managing Processes
R3 Kickoff Meeting
Creating a Process I
12
Process Definition Registry
Process Dispatcher EE type A instanceProcess Instance Registry
request to activateprocess X
ee-agentDecision Enginelookup
launch
enter
AMQP
Other
R3 Kickoff Meeting
Creating a Process II
13
Process Definition Registry
Process Dispatcher
Provisioner/DTRS
IaaS
EE type A instance
EPU Management
Process Instance Registry
request to activateprocess X
ee-agentDecision Enginelookup
launch
enter
request instance
create instance
AMQP
Other
R3 Kickoff Meeting
CC instance
CC instance
Inside an Execution Engine
14
EE type A instance
context-agent
ee-agent
ou-agent
supervisord
supervisord
supervisord
Matlab scriptC
C
M
CMR
CMR
CMK
CMKO
CMKO
datastream subscription result
Process Dispatcher
EPU Management
Package Server
process (adapter) 1
AMQP
Other
C – create M – monitor R – restart K – kill O – I/OC – create M – monitor R – restart K – kill O – I/O
R3 Kickoff Meeting
Adventures in Availability
• Time to repair (TTR)– Diagnosis– Time to scale (TTS)
• PENDING (request)• STARTED (deployment)• RUNNING
(contextualization)
04/18/23
15
A = MTBFMTBF+MTTR
Mean time between failures
Mean time to repair
TTS: preliminary results for 2,000 VMs provisioned on AWS EC2
R3 Kickoff Meeting
R3 Scope
• Process management– Activation and validation– New execution site registration
• Integration with National Infrastructure– Framework for integration of academic cloud
providers, TeraGrid and OSG– Integration with Microsoft cloud
16
R3 Kickoff Meeting
R3 Activities
• Refine/change scope to achieve a complete and maintainable system
• Decide on specific solutions for R3 scope
17
R3 Kickoff Meeting
Questions?
18
top related