Top Banner
RN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf of the ARDA Dashboard team
15

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Dec 27, 2015

Download

Documents

Kristina Moody
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring the ATLAS Distributed Data

Management SystemRicardo Rocha (CERN)

on behalf of the ARDA Dashboard team

Page 2: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it CHEP2007,Victoria, Canada [email protected] - 2

Outline

• Dashboard project• ATLAS DDM system• DDM Dashboard• Monitoring for operators• Monitoring for end users• Conclusion

Page 3: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Dashboard Project

• Started inside the ARDA group of the EGEE/LCG project in 2005– Initially covering only job monitoring for CMS

• Evolved into a python framework providing a set of flexible tools allowing coverage of other grid application areas

• The framework consists of a set of different components– Data access layer (DAO)– Service configuration (agents)– Web application– Command line tools– APIs

• Strong focus on allowing easy access to the information– HTTP query interface– Output in HTML (web interfaces), but also XML and CSV for integration

with external tools

• Applications currently cover job monitoring (for all HEP experiments + VLEMED/Biomed), data management, site efficiency / reliability, and many others

CHEP2007,Victoria, Canada [email protected] - 3

Page 4: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

ATLAS DDM System

• Distributing all the data in the ATLAS infrastructure• Data is organized in datasets – collections of files• Users issue subscriptions on these datasets• Different agents take care of the several tasks

required for the successful movement of the data

CHEP2007,Victoria, Canada [email protected] - 4

Page 5: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

ATLAS DDM System

• Each set of agents serves one or a group of sites (typically related to each other - cloud)

• Initial deployment involved the setup of these services on the VO boxes at each Tier1

• Debugging was extremely difficult (still is…)– Involved logging into each of the machines– And correlating this information…

• Accounting and monitoring virtually impossible• A central point where all the information would

be made available became vital

CHEP2007,Victoria, Canada [email protected] - 5

Page 6: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

DDM Dashboard

• Main focus on ATLAS specific services (DQ2 system), receiving information from the different agents via HTTP callbacks– Transfer state changes– Dataset complete

– Transfer complete– Transfer / registration errors

• But also on grid fabric services– Data management related services up and running

– Storage space availability

• Data is put together in a structured way– Oracle database at CERN

• Different tools (agents) responsible for generating statistics and metrics

CHEP2007,Victoria, Canada [email protected] - 6

Page 7: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

DDM Dashboard

• Serves different sets of use cases, coming from different types of users

• Site / system operators– “How is the overall system doing”?– “How is site X doing”?– “What is the most common error, and what is triggering it”?

• End users / production coordinators– “What is the status of this (set of) dataset subscription(s)”?– “When will the data become available”?– Essential to have real time information

• How much data? A lot!– Millions of file transfers, each reporting the different steps

• Average week means 2 million hits (90%+ bulk reports with up to 500 items)

– Especially critical when systems misbehave (more errors)– Lot of work on partitioning the data, optimizing the database and the

web server setup (Apache)

CHEP2007,Victoria, Canada [email protected] - 7

Page 8: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for operators

• “How is the whole system performing”?

CHEP2007,Victoria, Canada [email protected] - 8

Page 9: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for operators

• “What is wrong with site X”?

CHEP2007,Victoria, Canada [email protected] - 9

Page 10: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for operators

• “What files are causing error ‘…’”?

CHEP2007,Victoria, Canada [email protected] - 10

Page 11: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for operators

• “What files are causing error ‘…’”?

CHEP2007,Victoria, Canada [email protected] - 11

Page 12: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for operators

• “What files are causing error ‘…’”?

CHEP2007,Victoria, Canada [email protected] - 12

Page 13: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Monitoring for end users

• “I subscribed to dataset X on site Y. What is the status”?

CHEP2007,Victoria, Canada [email protected] - 13

Page 14: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

Conclusion

• Essential tool for all ATLAS operations• Currently used by a large number of people,

from site / system responsibles to end users• Usage goes beyond the web interface

– Data being queried by different external tools for automating operations (catalogs cleanup, consistency checks, alarms and notifications, …)

• What is coming next– Integration with the site specific monitoring tools– More alarms and notifications– Automated reaction to specific events– More focus on the end user

CHEP2007,Victoria, Canada [email protected] - 14

Page 15: CERN - IT Department CH-1211 Genève 23 Switzerland  t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

CERN - IT DepartmentCH-1211 Genève 23

Switzerlandwww.cern.ch/it

• Tier 0 exporthttp://dashb-atlas-data-tier0.cern.ch/dashboard/request.py/site

• Productionhttp://dashb-atlas-data.cern.ch/dashboard/request.py/site

• Homepage

http://dashboard.cern.ch

• Contact

[email protected]

CHEP2007,Victoria, Canada [email protected] - 15