Top Banner
Prague Site Report Jiří Chudoba Institute of Physics, Prague 23.4.2012 Hepix meeting, Prague
22

Prague Site Report

Feb 23, 2016

Download

Documents

genica

Prague Site Report. Jiří Chudoba Institute of Physics, Prague. 23.4.2012 Hepix meeting, Prague. Local Organization. Institute of Physics: 2 locations in Prague, 1 in Olomouc 786 employees (281 researchers + 78 doctoral students) Department of Networking and Computing Techniques (SAVT) - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Prague Site Report

Prague Site Report

Jiří ChudobaInstitute of Physics, Prague

23.4.2012 Hepix meeting, Prague

Page 2: Prague Site Report

[email protected] 2

Local Organization• Institute of Physics:

o 2 locations in Prague, 1 in Olomouco 786 employees (281 researchers + 78 doctoral students)

• Department of Networking and Computing Techniques (SAVT)o networking up to offices, mail and web servers, central services

• Computing centre (CC)o large scale calculationso part of SAVT (except leader – Jiri Chudoba)

• Division of Elementary Particle Physicso Section Department of detector development and data processing

• head Milos Lokajicek• started large scale calculations, later transferred to CC• the biggest hw contributor (LHC computing)• participates in the CC operation

Page 3: Prague Site Report

[email protected] 3

Server room I• Server room I (Na Slovance)

o 62 m2, ~20 racks 350 kVA motor generator, 200 + 2 x 100 kVA UPS, 108 kW air cooling, 176 kW water cooling

o continuous changeso hosts computing servers and central services

Page 4: Prague Site Report

[email protected] 4

Other server rooms• New server room for SAVT

o located next to server room Io independent UPS (24 kW now, max 64 kW n+1),

motor generator (96 kW), cooling 25 kW (n+1)o dedicated for central services o 16 m2, now 4 racks (room for 6)o very high reliability requiredo first servers moved in last week

• Server room Cukrovarnickao another building in Pragueo 14 m2, 3 racks (max 5), 20 kW central UPS, 2x8 kW

coolingo backup servers and services

• Server room UTIAo 3 racks, 7 kW cooling, 3 + 5x1.5 kW UPSo dedicated to Department of Condensed Matter

Theory

Page 6: Prague Site Report

[email protected] 6

Clusters in CC - Dorje• Dorje: Altix ICE8200, 1.5 rack

o 512 cores on 64 diskless WN, IB, 2 disk arrays (6+14 TB)o only local users, solid state physics, condense matter theoryo 1 admin for administration and user supporto relatively small number of jobs, MPI jobs up to 256 processeso Torque + Maui, SLES10 SP2, SGI Tempo, MKL, OpenMPI, ifort

• users run mostly: Wien2k, vasp, fireball, apls

Page 7: Prague Site Report

[email protected] 7

Cluster LUNA• 2 servers SunFire X4600

o 8 CPUs 32 cores, 256 GB RAM• 4 servers SunFire V20z, V40z• Operated by CESNET Metacentrum – distributed

computing activity of the NGI_CZ• Metacentrum

o 9 locationso 3500 coreso 300 TB

Page 8: Prague Site Report

[email protected] 8

Cluster Thsun, Small group servers

• Thsuno “private” cluster

• small number of users• power users with root privileges

o 12 servers of variable hw• servers for groups

o managed by groups in collaboration with CC

Page 9: Prague Site Report

[email protected] 9

Cluster Golias• Upgraded every year –

several subclusters of the identical hw

• 3812 cores, 30700 HS06• almost 2 PB disk space• the newest (March 2012)

subcluster rubus:o 23 nodes SGI Rackable C1001-G13o 2x (Opteron 6274 16 cores) 64 GB RAM,

2x SAS 300 GBo 374 W (full load)o 232 HS06 per node, 5343 HS06 total

Page 10: Prague Site Report

[email protected] 10

47%

26%

22%

4% 1%

37%

30%

28%

1% 4%

d0

alice

atlas

auger

solid

Golias shares2011 HS06 shareAlice+Star 7551 30Atlas 7087 28D0 9165 37Solid 914 4Calice 30 0Auger 205 1

24951 100

2012 HS06 shareAlice+Star 7564 25Atlas 11861 39D0 9969 32Solid 629 2Calice 13 0Auger 668 2

30704 100

3%4% 15%

22%

15%8%

5%

12%

17%

Golias-pGolias-cIberisIbisIbSalixSaltixDorjeRubus

Subclusters contribution to the total performance

Planned vs real usage (walltime)

Page 11: Prague Site Report

[email protected] 11

WLCG Tier2• cluster Golias@FZU + xrootd servers @ Rez• 2012 pledges:

o ATLAS 10000 HS06, 1030 TiB; 11861 HS06 available, 1300 TB av.o ALICE 5000 HS06, 420 TiB; 7564 HS06, 540 TB available

• delivery of almost 600 TB delayed due to floods• 66% efficiency is assumed for WLCG accounting

o sometimes under 100% of pledges• Low cputime/walltime ratio for the ALICE

o not only on our siteo Tests with limits on number of concurrent jobs (last week)o “no limit” (about 900 jobs) – 45%o limit 600 jobs - 54 %

Page 12: Prague Site Report

[email protected] 12

Utilization• Very high average utilization

o several different projects, different tools for productiono D0 – production submitted locally by 1 usero ATLAS – panda, ganga, local users; DPMo ALICE – VO box; xrootd

D0

ALICE

ATLAS

Page 13: Prague Site Report

[email protected] 13

Networking• CESNET upgraded our

main CISCO routero 6506 -> 6509o supervisor SUP720 -> SUP2To new 8x 10G X2 cardo planned upgrade of power

supplies 2x3kW -> 2x6 kW• (2 cards 48x1 Gbps, 1 card

4x10 Gbps, FW service module)

Page 17: Prague Site Report

[email protected] 17

External connection• Exclusive: 1 Gbps (to FZK) + 10 Gbps (CESNET)• Shared: 10 Gbps (PASNET – GEANT)

• Not enough for ATLAS T2D limit (5 MB/s to/from T1s)• Perfsonar installed

FZK -> FZU

FZU -> FZK

PASNET link

Page 18: Prague Site Report

[email protected] 18

Miscellaneous items• Torque server performance

o W jobs, sometimes long response timeo divide Golias in 2 clusters with 2 torque instances?o memory limits for ATLAS and ALICE queues

• CVMFSo used by ATLAS, works wello some older nodes have too small disks -> excluded for ATLAS

• Managemento Cfengine v2 used for productiono Puppet used for IPv6 testbed

• 2 new 64 core nodeso SGI Rackable H2106-G7, 128 GB RAM, 4x Opteron 6274 2.2GHz, 446 HS06o frequent crashes when loaded with jobs

• Another 2 servers with Intel SB expectedo small subclusters with different hw

Page 19: Prague Site Report

[email protected] 19

Water cooling• Active vs passive cooling doors

o 1 new rack with cooling doorso 2 new cooling doors on APC racks

Page 20: Prague Site Report

[email protected] 20

Water cooling

good sealing crucial

diskservers on off (divider added)

disk

serv

ers

work

er n

odes

rubus01

Page 21: Prague Site Report

[email protected] 21

Distributed Tier2, Tier3s• Networking infrastructure (provided by CESNET)

connects all Prague institutions involvedo Academy of Sciences of the Czech Republic

• Institute of Physics (FZU, Tier-2)• Nuclear Physics Institute

o Charles University in Prague• Faculty of Mathematics and Physics

o Czech Technical University in Prague• Faculty of Nuclear Sciences and Physical Engineering• Institute of Experimental and Applied physics

• Now only NPI hosts resources visible in Grido Many reasons why others do not: manpower, suitable rooms, lack of IPv4

addresses• Data Storage group at CESNET

o deployment for LHC projects discussed

Page 22: Prague Site Report

22

• Thanks to my colleagues for help with preparation of these slides:

o Marek Eliášo Lukáš Fialao Jiří Horkýo Tomáš Hrubýo Tomáš Kouba o Jan Kundráto Miloš Lokajíčeko Petr Roupeco Jana Uhlířováo Ota Velínský

[email protected]