Jul 11, 2015
HPC Infrastructures at SURFsara
• Cartesius: 40000 cpu cores (Top-500 #45)
• Lisa: 9000 cpu cores
• Grid: (LHC, Life sciences)
• Hadoop
• Cloud (30 x 32 cores, 256GB)
HPC Cloud architecture
800TB
Shared
Storage
(NFS)
2×10Gbps
4
Service
Nodes
Virtualised
Services
30 large
compute
nodes
. . . .
. .
10 small
compute
nodes
Who was using our cloud in 2014?
Cell genetics 45%
Linguistics 10%
Medicine 6%
Economy 4%
Marketing 5%
Ecology 4%
Geography 2%
Civil engineering 7%
Physics 5%
Business 3%
Computer sciences 7%
Other 2%
Our cluster is old, we want a new cluster... in your cloud!
A typical cluster
• Lots of worker nodes
• Central user management (NIS, LDAP)
• Shared home file system
• Local disks for fast I/O
• A job scheduler (Torque, SGE, SLURM)
• Fixed size, bare metal
Typical job
Queue monitor
• ~400 lines of ruby code
• Runs within an EventMachine mainloop
• Uses qstat to monitor queues
• Uses qconf to add/remove nodes to/from queues
• Uses OCA to start/stop nodes
Adding nodes...
Start VM
Inspect queue
jobs waiting?
machine
started?
tell scheduler wait...
wait...
qstat...
ruby OCA
yes
no yes
no
ssh...
qconf
Removing nodes...
tell scheduler
Inspect queue
nodes idle?
wait...
shutdown VM
yes
no ruby OCA
qconf
qstat...
Does it work?
• Yes, in principle...
• But...
Future of our cloud
• OpenNebula 4.x (January 2015)
• More compute nodes (February 2015)
• Ceph storage (February 2015)
• Local SSDs (February 2015)
• GPUs
Conclusions
• Integration with OCA/XML-RPC is possible and flexible
• Know your users and what they want (cattle? pets?)
?