Top Banner
HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero Evolution of the maintainability of HPC facilities at CIEMAT headquarters Antonio Juan Rubio Montero [on belhaf of the ICT Division] [Centro de Investigaciones Energéticas Medioambientales y Tecnológicas (CIEMAT) Madrid, Spain]
17

Evolution of the maintainability of HPC facilities at ...

Jul 03, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Evolution of the maintainability of HPC

facilities at CIEMAT headquarters

Antonio Juan Rubio Montero [on belhaf of the ICT Division]

[Centro de Investigaciones Energéticas Medioambientales y Tecnológicas (CIEMAT)

Madrid, Spain]

Page 2: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Punched Cards

History of HPC facilities at CIEMAT

50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s

UNIVAC

• 1959 UNIVAC SS • 1971 UNIVAC 1106

2

Unfortunately, Grace Hopper didn’t work on our UNIVAC SOLID STATE, but we had one!!!

Page 3: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Mainframe Punched Cards

History of HPC facilities at CIEMAT

50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s

UNIVAC

• 1959 UNIVAC SS • 1971 UNIVAC 1106

• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81

3

Page 4: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Vectorial

Mainframe Punched Cards

History of HPC facilities at CIEMAT

50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s

UNIVAC

• 1959 UNIVAC SS • 1971 UNIVAC 1106

• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81

IBM

• 1985 IBM 3090/150

PDC is built

4

Page 5: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Vectorial

Mainframe Punched Cards

History of HPC facilities at CIEMAT

50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s

UNIVAC

• 1959 UNIVAC SS • 1971 UNIVAC 1106

• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81

IBM

• 1985 IBM 3090/150

CRAY

• 1991 CRAY XMS • 1991 YMP-EL • 1995 J90

PDC is built

5

Page 6: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

MPP Vectorial

Mainframe Punched Cards

History of HPC facilities at CIEMAT

50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s

UNIVAC

• 1959 UNIVAC SS • 1971 UNIVAC 1106

• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81

IBM

CRAY

• 1985 IBM 3090/150

• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90

PDC is built

• 1995 CRAY T3E

6

Page 7: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

MPP Vectorial

Mainframe Punched Cards

History of HPC facilities at CIEMAT

50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s

UNIVAC

• 1959 UNIVAC SS • 1971 UNIVAC 1106

• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81

IBM

CRAY

• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90

PDC is built

• 1995 CRAY T3E

• 1985 IBM 3090/150

(2000) STK9310Library [1,500 cartridges]

7

Page 8: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

MPP Vectorial

Mainframe Punched Cards

History of HPC facilities at CIEMAT

50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s

UNIVAC

• 1959 UNIVAC SS • 1971 UNIVAC 1106

• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81

IBM

CRAY

• 1985 IBM 3090/150

• 1995 CRAY T3E • 2001 SGI Origin 3800 • 2003 SGI Altix 3700

SGI

• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90

PDC is built

8

STK9310

Page 9: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

MPP NUMA Cluster Vectorial

Mainframe Punched Cards

History of HPC facilities at CIEMAT

50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s

UNIVAC

• 1959 UNIVAC SS • 1971 UNIVAC 1106

• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81

IBM

CRAY

• 1985 IBM 3090/150

SGI

• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90

Beowulf

• 1995 CRAY T3E • 2001 SGI Origin 3800 • 2003 SGI Altix 3700

• 2005 Lince (x86-32)

PDC is built

STK9310

9

Page 10: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

MPP NUMA Cluster Vectorial

Mainframe Punched Cards

History of HPC facilities at CIEMAT

50’s 60’s 70’s 80’s 90’s 00’s 10’s 20’s

UNIVAC

• 1959 UNIVAC SS • 1971 UNIVAC 1106

• 1977 UNIVAC 1110 • 1978 UNIVAC 1110/81

IBM

CRAY

• 1985 IBM 3090/150

SGI

• 1991 CRAY XMS • 1991 CRAY YMP-EL • 1995 CRAY J90

Beowulf

• 1995 CRAY T3E • 2001 SGI Origin 3800 • 2003 SGI Altix 3700

• 2005 Lince (x86-32)

• 2008 Euler (23TFlops) • 2010 Dirac (1.27TFlops) • 2015 ACME(40.6+18.8TFlops)

In 2019 first ¼ of Euler-2

(125.9TFlops) PDC is built

Future

Current

STK9310

10

Page 11: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Current HPC infrastructure at CIEMAT headquarters • Uninterruptible power supply: new

batteries and diesel engine 1,000KVA • Efficient cooling , fire protection

11

Page 12: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Current HPC infrastructure at CIEMAT headquarters • Uninterruptible power supply: new

batteries and diesel engine 1,000KVA • Efficient cooling , fire protection

• (2008) Euler (23TFlops) • (2010) Dirac (1.27TFlops) - 251 nodes, 2052 Xeon cores - 2 PBS/Torque - Infiniband - Unchanged base software

• 350 users in 30 research groups, 100 external. • Whole monitoring through Nagios: temp., humidity,

power, batteries, hardware and services

12

Page 13: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Current HPC infrastructure at CIEMAT headquarters • Uninterruptible power supply: new

batteries and diesel engine 1,000KVA • Efficient cooling , fire protection

• (2008) Euler (23TFlops) • (2010) Dirac (1.27TFlops) - 251 nodes, 2052 Xeon cores - 2 PBS/Torque - Infiniband - Unchanged base software • (2015) ACME

- 24 nodes - 720 Xeon cores (40.6 Tflops) - 2 Tesla P100 GPU (18.8TFlops) - Slurm, Infiniband - Continously updated

• 16 RAID NAS servers (NFS) - 1 intelligent device (NetApp) - 13 generic SAN Ethernet - 1 RDMA Infiniband (ACME) - > 1,5 PB total

• 350 users in 30 research groups, 100 external. • Whole monitoring through Nagios: temp., humidity,

power, batteries, hardware and services)

13

Page 14: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Current HPC infrastructure at CIEMAT headquarters

IBM TS3584 Tape Library (18 drives, 1,581 cartridges,

4,42 PB) Daily incremental, 3 months

Secondary storage servers daily make differential rsync copies 2 months

X 15 Ethernet 1-10Gbps

Ethernet 1Gbps

Euler

14

Page 15: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Current HPC infrastructure at CIEMAT headquarters

IBM TS3584 Tape Library (18 drives, 1,581 cartridges,

4,42 PB) Daily incremental, 3 months

Secondary storage servers daily make differential rsync copies 2 months

X 15 Ethernet 1-10Gbps

NetApp FAS2554 Hourly snapshots 3 weeks

Ether. 4x1Gbps

Ethernet 1Gbps

Euler

15

Page 16: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

Future acquisitions (2019)

Euler replacement. Practices: - constellation design - Slurm based:

- checkpointing - predefined containers

- yearly update cycle: - software - 25% of hardware

- Daily snapshots MD34xx - NDMP backup 10Gbps 2019 first ¼ of Euler-2 - 41 nodes - 1640 Xeon 6148 cores - Rpeak > 125.9TFlops - 600 TB based on Lustre

16

Page 17: Evolution of the maintainability of HPC facilities at ...

HPC Management Good Practices Workshop – CARLA 2018 AJ Rubio-Montero

antonio.rubio <at> ciemat.es CIEMAT

Avda. Complutense, 40 – 28040 Madrid http://www.ciemat.es

http://rdgroups.ciemat.es/web/sci-track/

17

THANK YOU!!!