Top Banner
PES CERN IT Department CH-1211 Gen` eve 23 Switzerland www.cern.ch/it CERN IT Department CERN Batch System, Monitoring and Accounting HEPiX Fall 2012 erˆ ome Belleman CERN – IT-PES October 2012
21

CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

Jan 31, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

CERN Batch System, Monitoring andAccounting

HEPiX Fall 2012

Jerome BellemanCERN – IT-PES

October 2012

Page 2: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

2 – CERN BatchSystem, Monitoring

and Accounting

Context

Growing community

Busier batch system

Agile Infrastructure project

Page 3: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

3 – CERN BatchSystem, Monitoring

and Accounting

Outline

1 Batch System Challenges

2 Batch Monitoring Tools

3 Batch Accounting Overhaul

Page 4: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

4 – Batch SystemChallenges

Section 1

Batch System Challenges

Page 5: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

5 – Batch SystemChallenges

CERN Batch Setup

Platform LSF 7.0.6

All resources to one cluster

Different shares for different customers: public, grid andseveral for CERN experiments

LSF Master NodeNFS Server

LSF Master Failover

WNWN WN WN WN WN WN WN WN WN

Local Jobs Grid Jobs

Page 6: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

6 – Batch SystemChallenges

A Large Batch System

> 4 000 physical nodes

> 60 000 cores, some SMT-enabled (25% overcommit)

> 55 000 job slots, > 400 000 jobs/day:

Page 7: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

7 – Batch SystemChallenges

Future of the Batch Service

Agile Infrastructure Project:

Virtualise resources in CC: batch nodes to be fat VMs

Uniform IaaS layer

Configuration management with Puppet

Page 8: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

8 – Batch SystemChallenges

Today’s Operational Issues

High submission and query load → Slow response

Ensuring fairshare scheduling

Complex LSF setup

Poor dynamism requiring daily reconfiguration

Scalability

Page 9: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

9 – Batch SystemChallenges

Possible Alternatives to LSF

Goal for 5 years:

4 000→ 12 000 physical nodes

60 000→ 300 000 cores

Support frequent structural changes

Possible alternatives (unordered):

LSF 8

Condor

Grid Engine

Torque

SLURM ←−

Page 10: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

10 – Batch SystemChallenges

Evaluating SLURM

From the SLURM Web site:

Free

65 000 physical nodes

120 000 jobs/hour

Active community

Extensible via plug-ins

Test bed:

Implement and test hierarchical fairshare model

Controllably submit queries and jobs

Reproducible load

Scale number of hosts, jobs, slots and queries

Page 11: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

11 – BatchMonitoring Tools

Section 2

Batch Monitoring Tools

Page 12: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

12 – BatchMonitoring Tools

Technology Overview

Oracle, Python, Matplotlib & Django → Stats

Cassandra → Fairshare monitoring

OpenTSDB → Live monitoring

Splunk → Historical usage

Page 13: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

13 – BatchMonitoring Tools

Live Monitoring with OpenTSDB

Page 14: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

14 – BatchMonitoring Tools

Historical Usage with Splunk

Page 15: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

15 – BatchAccounting Overhaul

Section 3

Batch Accounting Overhaul

Page 16: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

16 – BatchAccounting Overhaul

New Batch Accounting: Goals

Make portable to other schedulers

Publish local job information

Publish correct normalisation factor per job

Use the new APEL software

Remove complexity, improve consistency

Page 17: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

17 – BatchAccounting Overhaul

Old vs. New Batch Accounting

CEsBLAH

File

LRMSAcct.File

Acct.

ReportsAcct.Page

APELAcct.Portal

Daily

FilterLocalAPEL

SSMMessaging

Page 18: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

17 – BatchAccounting Overhaul

Old vs. New Batch Accounting

CEsBLAH

File

LRMSAcct.File

Acct.

ReportsAcct.Page

APELAcct.Portal

Daily

FilterLocalAPEL

SSMMessaging

Page 19: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

17 – BatchAccounting Overhaul

Old vs. New Batch Accounting

CEsBLAH

File

LRMSAcct.File

Acct.

ReportsAcct.Page

APELAcct.Portal

Real-T

ime

FilterLocalAPEL

SSMMessaging

Page 20: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

18 – CERN BatchSystem, Monitoring

and Accounting

Conclusion

We need to scale

We’re moving to new infrastructure tools

CERN batch service being prepared for future challenges

Page 21: CERN Batch System, Monitoring and Accounting - HEPiX Fall 2012 · 16 { Batch Accounting Overhaul New Batch Accounting: Goals Make portable to other schedulers Publish local job information

PES

CERN IT DepartmentCH-1211 Geneve 23

Switzerlandwww.cern.ch/it

CERNITDepartment

19 – CERN BatchSystem, Monitoring

and Accounting

Thanks!

Questions?