Document No: SKA-TEL-SDP-0000020 Unrestricted Revision: 1.0 Author: P. C. Broekema Release Date: 2015-02-09 Page 1 of 17 PDR.02.01.02 Compute platform: Software stack developments and considerations Document number……………………………………………………………SKA-TEL-SDP-0000020 Context…………………………………………………………………………………SKA.SDP.COMP Revision……………………………………………………………………………………………..,…1.0 Authors……………………………...D. Christie, P. Crosby, A. Ensor, N. Erdody, P. C. Broekema, B. A. do Lago, M. Mahmoud, R. O’Brien, R. Rocha, T. Stevenson, A. St John, J. Taylor, Y. Zhu, A. Mika Release Date………………………………………………………………………………...2015-02-09 Document Classification………………………………………………………………….. Unrestricted Status………………………………………………………………………………………….……. Draft
18
Embed
PDR.02.01.02 Compute platform: Software stack …broekema/papers/SDP-PDR...This document is intended as supporting material for the Software Stack chapter of the Compute platform sub-element
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Table of Contents List of Figures ............................................................................................................................ 4
ICE has been suggested for use as the main enterprise bus middleware technology for ASKAP.
Essentially its role in ASKAP computing is to coordinate the telescope's main components.
ZeroMQ
For non-enterprise or less-critical application messaging, a brokerless messaging system might
be used to reduce potential bottlenecks, latency and network communication associated with
enterprise-grade brokered messaging systems. The ZeroMQ (ØMQ, http://zeromq.org/)
lightweight messaging kernel is a C-based package that extends the familiar concept of TCP
socket interfaces with features for supporting asynchronous message queues and subscription-
based message filtering over a range of network protocols. Although ØMQ is written in C there
are API bindings available for many languages, including C++, C#, Java, and Python.
Unlike TCP sockets, ØMQ sockets can use various transport protocols (besides TCP it supports
pgm pragmatic general multicast and its encapsulated variant epgm, in-process transport and
ipc inter-process communication between threads sharing a ØMQ context), allows an arbitrary
number of incoming and outgoing connections and network connections can come and go
dynamically with ØMQ automatically reconnecting. Similarly to TCP, one side (nominally the
server) binds to a socket and the others (the clients) instead connect to the socket, although
ØMQ allows clients to connect before the server has bound the socket. All I/O for a socket is
handled asynchronously by a pool of background ØMQ threads.
ØMQ supports various messaging patterns including request-reply, publish/subscribe, pipeline,
and exclusive pair.
Cloud software
"Cloud computing" is a generic term that describes the ability to automatically and immediately provision compute capability without the need to invest capital in hardware or software. This is made possible by the development of technologies that virtualise the provisioning of servers, networks and other infrastructure. Similar virtualisation has taken place on a software level such that operating systems, middleware, databases, development tools and end-user tools can all also be automatically and immediately provisioned.
Amazon, Google, Microsoft and many smaller businesses all have their own versions of cloud computing. The market disruption they have caused is due to the ability they have to charge consumers of cloud capability on an as-you-go basis. This allows consumers to only pay for what they use which can lead to cost savings and different models of undertaking compute tasks that range from tiny to very large. Systems can be architected to scale up and down for temporal usage as opposed to being fixed to peak usage criteria.
OpenStack The free and open source community has risen to the challenge of cloud computing by largely coalescing around a suit of inter-operating projects that fall under the umbrella of the OpenStack foundation. These developments are effectively defining open standards for cloud computing. OpenStack is also creating the opportunity for many new players to enter the field of offering cloud services that were once the domain of a handful of large-scale suppliers. It also now makes commercial sense for organisations with high internal compute requirements to build
their own “private” clouds. Past efforts have tried projects such as OpenNebula or pure KVM, but lately, OpenStack [9] is showing better traction.
The OpenStack project started as a combined NASA and Rackspace development. It gained popularity very quickly and now has many large scientific institutions (e.g, CERN) among its users [10]. The project is supported by hundreds of companies with contributions made by thousands of developers. It is the fastest growing and developing open source project ever seen.
OpenStack aims to provide all the services required to run a private cloud while remaining agnostic in terms of the actual virtualization solution being used underneath. It shares with Linux a lot of the points that have made it successful: community effort, led by a foundation (there’s no single entity driving its features); fully open source, allowing easy adoption of new features; large user base with frequent summits for discussions covering operational procedures, best practices and development priorities.
There is a considerable amount of interest in using OpenStack in the general HPC community for the following reasons (see also [RD-01]):
● As HPC “as a service” is gaining momentum in Cloud environments. ● OpenStack provides a potential common framework for the integration of HPC
middleware components such as provisioning, management and control. Basing these elements on a common framework may aid development and provide consistency.
● (as is stated) elements such as containerization will allow DevOps on production resources thus providing test facilities at the scale of production level operation.
Based on the arguments above, OpenStack appears to be suitable as the basis for all the SDP
● glance for image management, which will be typically but not necessarily limited to a
Debian/Ubuntu image with all the required software installed,
● cinder for block-based data access, the backend of all virtual machine instances.
Support for LVM, Ceph, Glusterfs and many other storage solutions, leaving options
opened to specific deployments on the technology to use,
● nova for management of computing resources, including scheduling and allocation.
This solution provides in addition:
● multi-tenant support, including quota management based on groups of users,
● all the required packaging, especially well supported when used in conjunction with the
Debian/Ubuntu distribution mentioned above,
● potential to explore new usages of the computing infrastructure, which go beyond what
is typically done in scientific domains. Offerings of tools like R or other data analysis
services directly exposing their interfaces, with no need to even deal with the details of
setting up virtual machines, are extremely promising and should be explored.
● providing a common platform for engineers to develop and test software across the globe.
Apart from these, the range of services and agility of the new players (including large research institutes such as CERN) will allow the SKA to find a community of practitioners with whom it can share developments and co-develop new technologies. The use of OpenStack-based cloud
Document No: SKA-TEL-SDP-0000020 Unrestricted
Revision: 1.0 Author: P. C. Broekema
Release Date: 2015-02-09 Page 13 of 17
capability would also allow the SKA project to both insource and outsource compute tasks based on a common platform and the availability of suitable compute capability. Apart from the existing features, there are also some interesting ongoing developments relating to OpenStack including the scheduling and orchestration of compute tasks and the use of components such as Glance [7] and Glint [8], the ability to use tiny execution containers, such as ZeroVM [1], associated with storage objects and streams to pre-process data prior to it being shipped, and the use of extremely efficient operating systems such as CoreOS [2]. It is worth mentioning that in the framework of CSP and SDP, AUT, Catalyst and OpenParallel have successfully implemented and OpenStack instantiation on GPUs. Another point worth considering is an ongoing effort to providing Cloud Federation, led by
CERN and Rackspace. Following the success of its Grid infrastructure, CERN wants to achieve
the same using the new cloud-based interfaces.
Containerisation
Containerisation is an operating system level of virtualisation, where a single operating system
instance is shared by multiple isolated user space instances. In Linux isolation is maintained by
means of kernel namespaces, which add an identifier to system calls. In contrast, in hardware
virtualisation the physical hardware is abstracted from the user and a hypervisor presents a
virtual hardware interface to the software. Operating system level virtualisation is characterised
by:
● very low overhead
● direct access to the hardware (if allowed by the host operating system)
● much smaller images compared to hardware virtualized images.
While the cost model for cloud services is fundamentally based around the virtual private server (VPS), containers make it possible to still deploy multiple VPS-like instances without incurring the costs of multiple VPS yet retaining the characteristics and security around individual VPSs. A container is secure - effectively a confined user space (chroot jail). Strict guest separation is achieved by means of Linux kernel namespaces. It has the characteristics of an independent network stack, and resource control/quotas. Systems like Docker enable convenient mechanisms for patching resources from one container to another to create the necessary linkages for services that are jailed within a container to talk to each other. This includes network communication and filesystem access. These are the core characteristics that enable the likes of Juju [5] and Solum [6] to 'plug' components in containers together to create services. There are technical aspects of providing optimisations for:
● networking - VPSs will now be like mini VLANs as they run multiple hosts ● security - there may be considerations for specialist OSs (like CoreOS) that are
stripped back to be essentially boxes that hold containers with a lot of the standard OS processes removed
● compute - comparative research has shown that compute overhead of cotainerisation is minimal compared to virtual machines [RD-02]
Document No: SKA-TEL-SDP-0000020 Unrestricted
Revision: 1.0 Author: P. C. Broekema
Release Date: 2015-02-09 Page 14 of 17
● storage - Container style resource sharing will be yet another abstraction layer from real storage and it gets blurred when the pen OS shares it's storage with the container, or containers share storage between themselves. There are bound to be performance considerations if the trend is towards running processes like database storage within these environments
There are also management aspects: ● There will be a drive towards having abstracted interfaces for managing containers ● Performance monitoring tools that operate at the container level, container cluster level
(within a VPS and across VPSs), and VPS level
Service design aspects: ● Services need to be designed to take advantage of cloud “as a Service” principles (back-
ending by cloud storage, message queues, RDBMSs and so on, with horizontal scaling of web processes and multi-tenant concepts built in)
● It is likely that containerisation will require design pattern changes to take advantage of the paradigm.
With containerisation, each service is a cluster of individual processes (one distinct process per container). Processes should not assume that they speak directly to the underlying OS (eg., logging, temporary file storage). Containers can make standalone software viable as a service instead of having to re-engineer for multi-tenanting.
Application development environment and software
development kit Using OpenStack to provide a development and test architecture for SKA engineers would ensure a consistency of approach that will otherwise be hard to realise. It would also allow engineers to define and provision their own tool sets and at the same time offer these to other engineers should the need arise. Some benefits that OpenStack might provide include:
● Easy ability to build images with different standard software offerings, from which participants can easily spin up virtual machines (VMs) - simplifying system builds.
● Well-defined APIs so VMs can be created quickly on demand, then removed once the required computations have been completed - reducing cost.
● Can be hosted on their own hardware, providing full control over the hardware, or instead be hosted by a 3rd party, limiting the capital expenditure for the SKA participants.
● Data storage implementation can be tailored to the SKA requirements, these requirements may vary from raw data collection to the output of science experiments.
● OpenStack Cells or Regions can be used to have OpenStack clusters dotted around the world, but with one API, and one authentication database used by all of them. This allows limiting the data transfer of large data sets to within countries for example.
● Support by major HPC solution providers like Cray or IBM.
Document No: SKA-TEL-SDP-0000020 Unrestricted
Revision: 1.0 Author: P. C. Broekema
Release Date: 2015-02-09 Page 15 of 17
Schedulers
Schedulers used in pre-cursor telescopes
The SKA pre-cursors ASKAP (Australian Square Kilometre Array Pathfinder), LOFAR (Low-
Frequency Array of Netherland), and MeerKAT (a mid-frequency array of parabolic reflectors in
South Africa) have implemented various scheduling systems.
ASKAP uses a commercial scheduling software system named PBSPro. PBSPro is evaluated
as stable in functions with good control of computing hardware. PBSPro was chosen due to the
familiarity of iVEC (who maintain the ASKAP computing platform) with it. The major drawback is
the high license cost that rises rapidly with the number of computing nodes [9].
In LOFAR, another commercial scheduling software system named LST is being used. LST
provides a plethora of resource scheduling functions under the conditions of both online and
offline data processing [10]. The online pipelines perform the observations and data recording
tasks (inc., correlation and beam forming). The off-line (or post-processing) pipelines take care
of the raw data reduction and calibration/imaging. The two types of pipelines have different
scheduling constraints. In the case of the on-line pipelines, the scheduling software needs to
deal with the LOFAR hardware constraints and resource sharing (e.g., storage and LOFAR
station availability). It also needs to fulfil astronomical constraints such as target visibility and
LST scheduling. The off-line post-processing pipelines only need to be scheduled so that
science priorities are being met and the processing cluster is being used efficiently. The
drawback of this system from the viewpoint of the SKA is the lack of easy portability due to the
very different scheduling constraints for online and offline pipelines.
MeerKAT applies an open-source scheduling software named Celery [11]. There seems to be
no industrial support for this tool although it is still being developed actively by the open-source
community.
Adaptive Computing Moab HPC Suite
Moab (http://www.adaptivecomputing.com/) is a cluster management middleware that offers
workflow management facets such as schedule optimisation, resource monitoring and load
balancing. Moab is a commercial product. It is currently being utilised for job distribution on the
145 node low voltage Clover town Intel processor cluster known as "The Green Machine" at
Swinburne University's Center for Astrophysics and Supercomputing (CAS).
Simple Linux Utility for Resource Management
The Simple Linux Utility for Resource Management (SLURM, http://slurm.schedmd.com/) [RD-
04] is a modular open source resource manager designed for Linux clusters of all sizes. It is
currently the de-facto standard for cluster resource management in HPC. It allocates access to
resources and provides a framework to start, execute and monitor user jobs. The modular
nature of this scheduler may allow the scheduler behaviour to be modified to suit the SDP
requirements. Between the configurable aspects there are multiple plugins for scheduling,
accounting, prioritisation and topology-aware scheduling, among others.
A conventional HPC scheduler tries to minimise a time to answer while optimising the use of the
system for as many users as possible. The SDP scheduler has far fewer users, possibly only a
single one: the instrument. Additionally, the SDP scheduler needs to provide Local Monitoring
and Control with a runtime estimate for a particular scheduled pipeline well in advance, based
on a priori information that may be obtained from reference runs.
The possibly heterogeneous nature of the SDP Compute Island may complicate the issue even
further. While the scheduling task itself is probably less complicated in SDP than it is in a normal
HPC facility, making an existing tool a very attractive option, the additional tasks will require
significant effort to develop.
Between the pros of this project are:
● Open source project with very active users and developers community
● Scalable to thousands of nodes and hundreds of thousands of jobs per hour
● Compatibility with different MPI implementations
● Light-weight daemons
● Good and extensive documentation
● There is the option to have commercial support through SchedMD
Cons:
● Lacks more advanced scheduling policies as SLA or multi-cluster scheduling.
Conclusions
In summary, all of the aforementioned scheduling systems are well established and integrated
into existing management platform software. However, none of these systems comply with the
unique requirements of the SDP, i.e. the ability to estimate the required resources and runtime
beforehand, based on a-priori knowledge. As such, the best way forward is probably to rely on
basic policies available in existing scheduling software to design an SDP-specific scheduler that
supports both deadline- and resource-aware scheduling.
References [1] ZeroVM - A "process specific" hypervisor. See also "Using ZeroVM and Swift to Build a Compute Enabled Storage Platform" https://www.openstack.org/summit/openstack-summit-atlanta-2014/session-videos/presentation/using-zerovm-and-swift-to-build-a-compute-enabled-storage-platform
[2] CoreOS - Very efficient Linux distro for deployment of large numbers of servers https://coreos.com/
[3] LXC - Linux containers https://linuxcontainers.org/
[4] Docker - https://www.docker.com/ and see "Docker CEO: Why it just got easier to run multi-container apps" http://www.zdnet.com/docker-ceo-why-it-just-got-easier-to-run-multi-container-apps-7000036356/
[5] Juju - Canonical's Ubuntu based orchestration product https://juju.ubuntu.com/
[6] Solum - as per Juju but designed natively for OpenStack http://solum.io/
[7] Glance - the image service for OpenStack that allows common images to be defined, discovered and deployed automatically. [8] Glint - Image distribution in a multi-cloud environment. Leveraging Glance, this project from Victoria University, Vancouver allows images and associated jobs to be scheduled across multiple cloud platforms. http://www.internetnews.com/blog/skerner/openstack-glance-gets-new-glint-for-replicated-and-distributed-images.html
[9] B. Nitzberg, J.M. Schopf, and J.P. Jones, PBS Pro: Grid computing and scheduling
attributes, in Grid resource management. 2004, Springer. p. 183-190.
[10] M. de Vos, A.W. Gunst, and R. Nijboer, The LOFAR telescope: System architecture and
signal processing. Proceedings of the IEEE, 2009. 97(8): p. 1431-1437.
“PDR02-01-02Softwarestackdevelopmentsandconsiderations(1)” History
Document created by Verity Allan ([email protected])February 09, 2015 - 2:05 PM GMT - IP address: 131.111.185.15
Document emailed to P.C. Broekema ([email protected]) for signatureFebruary 09, 2015 - 2:05 PM GMT
Document viewed by P.C. Broekema ([email protected])February 09, 2015 - 2:33 PM GMT - IP address: 192.87.1.200
P.C. Broekema ([email protected]) verified identity with Google web identity Chris Broekema (https://www.google.com/profiles/118359904325355782244)February 09, 2015 - 2:34 PM GMT
Document e-signed by P.C. Broekema ([email protected])Signature Date: February 09, 2015 - 2:34 PM GMT - Time Source: server - IP address: 192.87.1.200
Document emailed to Paul Alexander ([email protected]) for signatureFebruary 09, 2015 - 2:34 PM GMT
Document viewed by Paul Alexander ([email protected])February 09, 2015 - 6:38 PM GMT - IP address: 131.111.185.15
Document e-signed by Paul Alexander ([email protected])Signature Date: February 09, 2015 - 6:39 PM GMT - Time Source: server - IP address: 131.111.185.15