March 6, 2009 March 6, 2009 Tofigh Azemoon Tofigh Azemoon 1 Real-time Data Access Monitoring Real-time Data Access Monitoring in Distributed, Multi Petabyte in Distributed, Multi Petabyte Systems Systems Tofigh Azemoon Tofigh Azemoon Jacek Becla Jacek Becla Andrew Hanushevsky Andrew Hanushevsky Massimiliano Turri Massimiliano Turri SLAC National Accelerator Laboratory SLAC National Accelerator Laboratory
21
Embed
March 6, 2009Tofigh Azemoon1 Real-time Data Access Monitoring in Distributed, Multi Petabyte Systems Tofigh Azemoon Jacek Becla Andrew Hanushevsky Massimiliano.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
March 6, 2009 March 6, 2009 Tofigh AzemoonTofigh Azemoon 11
Real-time Data Access Monitoring Real-time Data Access Monitoring in Distributed, Multi Petabyte in Distributed, Multi Petabyte
SystemsSystems
Tofigh AzemoonTofigh Azemoon
Jacek BeclaJacek Becla
Andrew Hanushevsky Andrew Hanushevsky
Massimiliano TurriMassimiliano Turri
SLAC National Accelerator LaboratorySLAC National Accelerator Laboratory
March 6, 2009 March 6, 2009 Tofigh AzemoonTofigh Azemoon 22
Soon a typical running HEP experiment will look like Soon a typical running HEP experiment will look like thisthis
100s of users100s of users
10s of thousands of batch nodes10s of thousands of batch nodes
1000s of data servers1000s of data servers
10s of millions of files10s of millions of filesContainingContaining
10s of PB of data10s of PB of data++
geographically distributed geographically distributed crossing many time zones.crossing many time zones.
March 6, 2009 March 6, 2009 Tofigh AzemoonTofigh Azemoon 33
Mission StatementMission Statement
Provide real time overall view of Provide real time overall view of system performancesystem performance
Respond to detailed queriesRespond to detailed queries to identify bottle necks to identify bottle necks
to optimize the systemto optimize the system
to aid in planning system to aid in planning system expansionexpansion
March 6, 2009 March 6, 2009 Tofigh AzemoonTofigh Azemoon 44
The SLAC ¼The SLAC ¼PBPB “kan” “kan” ClusterCluster
ClientsClients
kan001 kan002 kan003 kan004 kan059
bbr-rdr03 bbr-rdr04
Data ServersData Servers( 320 TB)( 320 TB)
ManagersManagers
March 6, 2009 March 6, 2009 Tofigh AzemoonTofigh Azemoon 55
Monitored ObjectsMonitored Objects
client
file
job
session
user
type_2
type_1
server
March 6, 2009 March 6, 2009 Tofigh AzemoonTofigh Azemoon 66
File classes to monitor aggregate valuesFile classes to monitor aggregate values
for groups of files for groups of files BaBar Examples:BaBar Examples:
March 6, 2009 March 6, 2009 Tofigh AzemoonTofigh Azemoon 77
Xrootd ServerXrootd Server
Highly scalable serverHighly scalable server Posix like access to filesPosix like access to files Load balancingLoad balancing Transparent recovery from Transparent recovery from
server crashesserver crashes Fault tolerantFault tolerant Very low latencyVery low latency
March 6, 2009 March 6, 2009 Tofigh AzemoonTofigh Azemoon 88
Monitoring Implementation in Monitoring Implementation in xrootdxrootd
Minimal impact Minimal impact on client requestson client requests
Robustness in Robustness in multimode failuremultimode failure
Precision & Precision & specificity of specificity of collected datacollected data
Real time Real time scalabilityscalability
Use UDP datagramsUse UDP datagrams Data servers Data servers
insulated from insulated from monitoring. Butmonitoring. But
DB and RT Data BackupDB and RT Data Backup File and User FilteringFile and User Filtering Staging MonitoringStaging Monitoring
Never enough disk to hold entire data sampleNever enough disk to hold entire data sample Disk uses power even when files are not Disk uses power even when files are not
accessedaccessed Fraction of Data AccessedFraction of Data Accessed
For each file typeFor each file type In specific time intervalsIn specific time intervals … …
Multi Experiment MonitoringMulti Experiment Monitoring Many experiments sharing computing resourcesMany experiments sharing computing resources