Federating Clouds for High Energy Physics OpenStack Summit , May 18-22, 2015 Andre Charbonneau, Martin Conklin, Ronald Desmarais, Colson Driemel, Colin Leavett-Brown, Randall Sobie, Michael Paterson, Ryan Taylor Ian Gable University of Victoria with significant assistance and support from the ATLAS and Belle II Collaborations, and CERN IT
32
Embed
Federating Clouds for High Energy Physicsheprcdocs.phys.uvic.ca/presentations/openstack-gable... · 2015. 5. 29. · Federating Clouds for High Energy Physics OpenStack Summit , May
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Federating Clouds for High Energy Physics
OpenStack Summit , May 18-22, 2015
Andre Charbonneau, Martin Conklin, Ronald Desmarais, Colson Driemel, Colin Leavett-Brown, Randall Sobie, Michael Paterson, Ryan Taylor
Ian GableUniversity of Victoria
with significant assistance and support from the ATLAS and Belle II Collaborations, and CERN IT
Ian Gable, University of Victoria
Outline
What is experimental High Energy Physics?
What our computing workloads look like?
Components of our Distributed Cloud
Cloud Scheduler: Batch Job Management
Glint: VM image distribution
Shoal: Squid cache discovery
Some results
2
Ian Gable, University of Victoria 3
27 km ring
Large Hadron Collider
Ian Gable, University of Victoria 4
Ian Gable, University of Victoria 5
Ian Gable, University of Victoria 6
ATLAS Detector2005
Ian Gable, University of Victoria 7
ATLAS Detector2014
Ian Gable, University of Victoria 8
40 millioncollisions per second
Ian Gable, University of Victoria 9
Belle IIDetector
KEK Laboratory
Ian Gable, University of Victoria
Scale and other experiments
Each interesting ‘event’ stored on disk
ATLAS experiment roughly 170 PB on disk today, now growing all the time
LHC Experiments and other High Energy Physics experiments sure to grow to exascale in coming years.
Now down to the details.
10
Ian Gable, University of Victoria
High Energy Physics Computing workloads
• High Throughput Computing workload composed of mostly embarrassingly parallel tasks (jobs).
• Jobs for HEP are usually 1-24 hours in length and can be done single core, or multi core jobs (memory saving)
• Jobs are either Monte Carlo simulation of collisions or analysis of real collision data from the detector readout
• Most of the workload today is run on ethernet connected Linux clusters from 500 - 10000 cores at Research and Education institutions around the world
• On any given day there is roughly ~300K cores running HEP jobs for the Worldwide LHC Computing Grid (collection of non-cloud federated Linux clusters)
11
Ian Gable, University of Victoria
Our IaaS timeline
12
Can we use Xen?2005
We discoverNimbusProject2006
Amazon EC22007
OpenStackArrives2010
MultipleNimbusClouds2008
Major Traction in HEP
CERNgoes
OpenStack2013
Multiple OpenStack
Clouds2012
Ian Gable, University of Victoria
Today’s Problem and Opportunity
We wish to be able to run across multiple clouds without having any ‘special’ relationship with those cloud providers. In other words we can’t impose any requirements on them.
CernVM is RHEL compatible HEP software appliance in only 20 MB
26
initrd: CernVM-FS + µContextualization
AUFS R/W Overlay
OS + Extras
KernelµC
ernV
MSc
ratc
hH
D
User Data (EC2, OpenStack, . . . )
FuseAUFS
CernVM-FSRepositories
12MB
100MB
Figure 1. A µCernVM based virtual machine is twofold. The µCernVM image contains a Linuxkernel, the AUFS union file system, and a CernVM-FS client. The CernVM-FS client connects toa special repository containing the actual operating system. The two CernVM-FS repositoriescontain the operating system and the experiment software.
store the CernVM-FS cache. Note that the scratch hard disk does not need to be distributed. Itcan be created instantaneously when instantiating the virtual machine as an empty, sparse file.
The init ramdisk contains the CernVM-FS client and a steering script. The purpose of thesteering script is to create the virtual machine’s root file system stack that is constructed byunifying the CernVM-FS mount point with the writable scratch space. To do so, the steeringscript can process contextualization information (sometimes called “user data”) from varioussources, such as OpenStack, OpenNebula, or Amazon EC2. Based on the contextualizationinformation, the CernVM-FS repository and the repository version is selected.
The amount of data that needs to be loaded in order to boot the virtual machine is very little.The image itself sums up to some 12 MB. In order to boot Scientific Linux 6 from CernVM-FS,the CernVM-FS client downloads an additional 100 MB. The CernVM-FS infrastructure used todistribute experiment software can be reused. In comparison, the (already small) CernVM 2.6virtual appliance sums up to 300 MB to 400 MB that needs to be fully loaded and decompressedupfront before the boot process can start. As a result, a booting µCernVM virtual machine startspractically instantaneously so that it can be, for instance, integrated with a web site that starts avirtual machine on the click of a button. An example of such a web site is a volunteer computingproject by the CERN theory group [11].
3. The µCernVM root file system stack
At the beginning of the Linux boot process, in the so called early user space, the Linux kernel usesa root file system in memory provided by the init ramdisk. The purpose of the early user space isto load the necessary storage device drivers to access the actual root file system. Once the actualroot file system is available, the system switches its root file system to the new root mount pointafter which the previous root file system becomes useless and is removed from memory.
Figure 2 shows the transformation of the file system tree in the early user space in µCernVM.First, the scratch hard disk is mounted on /root.rw. µCernVM grabs the first empty hard diskor partition attached to the virtual machine, or remaining free space on the boot hard disk. Itautomatically partitions, formats, and labels the scratch space. Due to the file system label,µCernVM finds an already prepared scratch space on next boot. The scratch space is used asa persistent writable overlay for local changes to the root file system and as a cache for the
3
http://cernvm.cern.ch
CVMFS is a caching network file system based on HTTP and optimized for software, i.e. millions of small files
erarchy only for lookups; the actual data transfer can takeplace among any two XRootD cache nodes.
3. CONTENT-ADDRESSABLE STORAGEWith content-addressable storage (CAS), files carry a file
name that depends on their content rather than on their lo-cation in a directory tree or on a storage device. The contentaddress is retrieved from a cryptographic hash (or at least acollision free hash) of the content. Content-addressable stor-age has many advantages, in particular for software reposi-tories
• Data integrity is trivial to verify by re-hashing files.
• Maintaining cache consistency is trivial as files are im-mutable and do never expire.
• Identical files in different locations are mapped to thesame content-addressable file. Hence, CAS providescontent de-duplication.
• The hash key used as file name can be re-used for dis-tributed hash tables and key-value stores.
File de-duplication has been observed to be very useful withLHC experiment software. From release to release only afraction of all files change. By de-duplication the number offiles can be reduced by more than a factor of 5. With filecompression, the volume can be reduced by a further factorof 2–3.
Besides these advantages, there are also some disadvan-tages of CAS: Converting from location-based addressing tocontent-addressable storage is compute-intensive. Still, itis worthwhile to use CAS for the “write once read many”(WORM) access pattern of software and experiment data.Cryptographic hashes might get broken, which has hap-pened in case of MD5 for instance. In such a case, all filesmust be re-hashed using another algorithm because other-wise an attacker could inject corrupted files somewhere inthe distribution chain.
A file system interface on top of CAS requires means totranslate the directory location into CAS. This is done byfile catalogs that map the directory location to the hash keyof a file. We store such catalogs in plain files that can becached as well. Moreover, file catalogs act as a pre-cache ofmeta-data; a single file catalog request results in a bulk ofmeta-data.
With millions of directory entries, such a catalog grows tothe order of Gigabytes. Hence, we partition large directorytrees into many file catalogs. Naturally, such partitioning isdone at the top-level directories of the software releases. Us-ing hash trees, i. e. storing the content hash key of a sub filecatalog in the parent catalog, the content hash key of a rootfile catalog is sufficient to re-construct an entire directoryhierarchy. We cryptographically sign the root file catalogin order to ensure data authenticity. The content hash keyof a root file catalog is also used as a means to publish filesystem updates. New root hash keys can be propagated ei-ther by using an expiry time stamp or by a publish-subscribesystem.
4. CDN FOR SMALL FILESCernVM-FS uses a content delivery network (CDN) to
transfer the CAS data from the software release publisher
Stratum 0read/write
Switzerland
UnitedKingdom
U.S. EastCoast
Taiwan
Stratum1
Pub
licMirr
ors
Stratum 2PrivateReplicas
ProxyHierarchy
Figure 1: HTTP content delivery network: Replicaservers are arranged in a ring topology (Stratum 0– Stratum 2). One protected r/w instance feeds afew reliable, public, and globally distributed mirrorservers. Proxy servers fetch content from the closestpublic mirror server. The public mirror server canin turn be a master for private mirror servers, forinstance in a large computing center.
to the worker nodes. The problem at hand requires a fault-tolerant CDN that scales to the order of 105 read-only clientsand ensures data integrity and authenticity. The scatteredresources used for LHC computing entail restrictions on thenetwork protocol options. For “volunteer” worker nodes be-hind a NAT layer, as well as for many Grid sites, Internetconnectivity is restricted to a few standard protocols, mostnotably, HTTP.
In combination with a hierarchy of web caches, HTTPscales smoothly to the size of the LHC Grid and beyond.Our CDN is shown in Figure 1. We currently use 2–3 levelsof web caches: a frontend cache at the Stratum 1 servers,local cache servers at the sites and an optional layer of re-gional cache servers for distributed sites such as the NordicData Grid Facility. The CERN Stratum 1 has currently 40–50 WLCG sites with a total number of 20 000–25 000 workernodes connected to it. The observed average load on theCERN Stratum 1 webserver is 200KB/s and 2–3 requestsper second, the load on the Stratum 1 frontend caches is500KB/s and 10 requests per second.
Besides reducing the load on the Stratum 1 side, localcaches reduce the latency on the local site. Latency is anissue especially for experiment analysis software, as it con-sists of very many small files. 50% of all files are smallerthan 4 kB and 80% of all files are smaller than 16 kB. Thesame statistics over actual requested files reveals 99% of allfiles being smaller than 5MB. The HTTP header overheaddoes not matter very much as a typical request is anywayanswered using 1–2 network packets. Using HTTP keep-alive, we have previously shown that the TCP/HTTP stackoutperforms AFS’s UDP/Rx stack [7].
Fault-tolerance is obtained by client-side fail-over logicand horizontal scaling. The local web caches are just dupli-cated. A single source web server, however, still is a singlepoint of failure because each cache miss leads to a request onthe next-higher cache level. If the source web server is tem-porarily unavailable, all requests that the Stratum 1 cachescannot handle would result in an I/O error of the CernVM
51
comes with it’s own CDN!
Requires fast and near HTTP cache
The caching challenge on IaaS cloud
When booting VMs on different arbitrary clouds they don’t know which squid they should use
In order to work well, VMs need to able to access a local web cache (squid) to be able to efficiently download all the experiment software and now OS libraries they need to run
If a VM is statically configured to access a particular cache it can be slow (Geneva Vancouver for example) and it can get overloaded