Top Banner
BINP/GCF Status Report Jan 2010 [email protected] k.su LCG
15
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report

Jan 2010 [email protected]

L C G

Page 2: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

213 Jan 2010 BINP/GCF Status Report

Overview

Current status Resource accounting Summary of recent activities and achievements BINP/GCF & NUSC (NSU) integration BINP LCG site related activities Proposed hardware upgrades Future prospects

Page 3: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

3

BINP LCG Farm: Present Status

13 Jan 2010

CPU: 40 cores (100 kSI2k) | 200 GB RAMHDD: 25 TB raw (22 TB visible)Input power limit: 15 kVAHeat output: 5 kW

Page 4: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 4

Resource Allocation Accounting(up to 80 VM slots are now available within 200 GB of RAM)

Computing Power LCG:

4 host systems now (40%) 70% share is prospected for

production with ATLAS VO (near future)

KEDR: 4.0 – 4.5 host systems

(40-45%) VEPP-2000, CMD-3, SND,

test VMs, etc.: 1.5 – 2.0 host systems

(15-20%)

Centralized Storage LCG:

0.5 TB (VM images) 15 TB (DPM + VO SW)

KEDR: 0.5 TB (VM images) 4 TB (local backups)

CMD-3: 1 TB is reserved for the scratch

area & local home NUSC / NSU:

up to 4 TB reserved for the local NFS/PVFS2 buffer

13 Jan 2010

90% full, 150% reserved (200% limit) 35% full, 90% reserved (100% limit)

Page 5: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 5

BINP/GCF Activities in 2009Q4Sorted by priority (from highest to lowest) [done] Testing and tuning 10 Gbps NSC/SCN channel to NSU and getting

it to production state [done] Deploying a minimalistic LCG site locally at BINP [done] BINP/GCF and NUSC (NSU) cluster network and virtualization

systems integration [done] Probing the feasibility of efficient use of resources under VMware

with native KEDR VMs deployed in various ways [done] Finding the long term stable configuration of KEDR VMs while

running on several host systems in parallel [in progress] Getting to production with ATLAS VO with 25 kSI2k / 15 TB

SLC4 based LCG site configuration [in progress] Preparing LCG VMs for running on NUSC (NSU) side [in progress] Studying the impact of BINP-MSK & BINP-CERN connectivity

issues on GStat & SAM test failures

13 Jan 2010

Page 6: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 6

BINP/GCF & NUSC (NSU) Integration BINP/GCF: XEN images NUSC: VMware images (converted from XEN) Various deployment options were studied:

IDE/SCSI virtual disk (VD) VD performance/reliability tuning Locally/centrally deployed 1:1 and 2:1 VCPU/real CPU core modes Allowing disabling swap on the host system

Up to 2 host systems with 16 VCPUs combined are tested(1 GB RAM/VCPU)

Long term stability (up to 5 days) is shown for locally deployed VMs yet, most likely the problems are related to the centralized storage system of NUSC cluster

Works are now suspended due to the hardware failure of NSC/SCN switch on the BINP side (more news by the end of the week)

13 Jan 2010

Page 7: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 7

BINP LCG Site Related Activities STEP 1: DONE

Defining the basic site configuration, deploying the LCG VMs, going through the GOCDB registration, etc.

STEP 2: DONE Refining the VMs configuration, tuning up the network, getting new RDIG host

certs, VOs registration, handling the errors reported by SAM tests, etc. STEP 3: IN PROGRESS

Getting OK for all the SAM tests (currently being dealt with) Confirm the stability of operations for 1-2 weeks Upscale the number of WNs to the production level

(from 12 up to 32 CPU cores = 80 kSI2k max) Ask ATLAS VO admins to install the experimental software on the site Test the site for ability to run ATLAS production jobs Check if the 110 Mbps SB RAS channel is capable to carry the load

of 80 kSI2k site Get to production with ATLAS VO

13 Jan 2010

Page 8: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 8

BINP/GCF Activities in 2010Q1-2Sorted by priority (from highest to lowest) Recovering from the 10 Gbps NSC/SCN failure on the BINP side Getting to production with 32-64 VCPUs for KEDR VMs on the NUSC side Recovering BINP LCG site visibility under GStat 2.0 Getting to production with ATLAS VO with 25 kSI2k / 15 TB LCG site

configuration Testing LCG VMs on NUSC (NSU) side Finding stable configuration for LCG VMs for NUSC Upscaling LCG site to 80-200 kSI2k by using both BINP/GCF and NUSC

resources Migrating LCG site to SLC5.x86_64 and CREAM CE as suggested by

ATLAS VO and RDIG Making a quantitative conclusion on how the existing NSC networking

channel is limiting our LCG site performance/reliability Allowing other local experiments to access NUSC resources via GRID

farm interfaces (using the farm as pre-production environment)

13 Jan 2010

Page 9: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 9

Future Prospects

13 Jan 2010

Major upgrade of the BINP/GCF hardware focusing on the storage system capacity and performance Up to 0.5 PB of online storage Switch SAN fabric

Further extension of SC Network and virtualization environment TSU with 1100+ CPU cores is the most attractive target

Solving the problem with NSK-MSK connectivity for the LCG site Dedicated VPN to MSK-IX seem to be the best solution

Start getting the next generation hardware this year 8x increase of CPU cores density Adding DDR IB (20 Gbps) network to the farm 8 Gbps FC based SAN 2x increase of storage density

Establish private 10 Gbps links between the local experiments and BINP/GCF farm thus allowing them to use NUSC resources

Page 10: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 10

680 CPU cores/540 TB Configuration

13 Jan 2010

16 CPU cores / 1U, 4 GB RAM / CPU core, 8 Gbps FC SAN fabric,20 Gbps (DDR IB) / 10 Gbps (Ethernet) / 4x 1 Gbps (Ethernet) interconnect

95 kVA UPS subsystem2012 (prospected)

1.4 M$ in total

Page 11: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 11

168 CPU cores/300 TB Configuration

13 Jan 2010

55 kVA UPS subsystem

5x CPU power, 10x storage capacity, adding DDR IB & 8 Gbps FC already

2010 (proposed)

+14 MRub

Page 12: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 12

PDU & Cooling Requirements PDU

15 kVA are available now (close to the limits, no way to plug the proposed 20 kVA UPS devices!)

170-200 kVA (0.4kV) & APC EPO subsystems are needed (draft of the tech. specs was prepared in 2009Q2)

Engineering drawings for BINP/ITF hall have been recovered by CSD

The list of requirements is to be finalized yet

Cooling 30-35 kW are available now

(7 kW modules, open tech. water circuit)

120-150 kW of extra cooling is required (assuming N+1 redundancy schema)

Various cooling schemas were studied though locally installed water cooled air conditioners seem to bethe best solution (18kW modules, closed water loop)

No final design yet

13 Jan 2010

Once the plans for hardware purchasing are settled for 2010 the upgrade must be initiated

Page 13: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 1313 Jan 2010

Prospected 10 Gbps SC Network Layout

1000+CPU cores(2010Q3-4)

1100+ CPU cores(since 2007)

Page 14: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

BINP/GCF Status Report 14

Summary Major success is achieved in BINP/GCF and NUSC (NSU)

computing resources The schema tested with KEDR VMs should be exploited by other

experiments as well (e.g. CMD-3) 10 Gbps channel (once restored) will allow the direct use of NUSC

resources from the BINP site (e.g. ANSYS for needs of VEPP-2000) LCG site may take advantage of using the NUSC resources as well

(200 kSI2k will give us much better appearance) The upgrade of the BINP/ITF infrastructure is required for installing

the new hardware (at least for PDU subsystem) If we are able to get extra networking hardware as proposed we

may start plugging the experiments to the GRID farm and NUSC resources with 10 Gbps Ethernet uplinks this year

13 Jan 2010

Page 15: BINP/GCF Status Report Jan 2010 A.S.Zaytsev@inp.nsk.su.

Questions & Comments