Overview of the Computer Resource Team (CRT) Blaise Barney (LLNL) Rob Cunningham (LANL) Barbara Jennings (Sandia) PSAAP Kickoff Meeting July 8, 2008 Albuquerque, NM LLNL-PRES-405061
Dec 29, 2015
Overview of the Computer Resource Team (CRT)
Blaise Barney (LLNL)Rob Cunningham (LANL)
Barbara Jennings (Sandia)
PSAAP Kickoff MeetingJuly 8, 2008 Albuquerque, NM
LLNL-PRES-405061
2
What Is The CRT?What Is The CRT?
The Computer Resource Team (CRT) is the component of the PSAAP program that connects Alliance researchers to the High Performance Computing (HPC) resources required to perform their work
The CRT is comprised of a representative from each NNSA Lab who is familiar with their lab's computing resources, personnel and policies. The following individuals serve on the CRT: • Blaise Barney, LLNL • Rob Cunningham, LANL• Barbara Jennings, SNL
Our primary purpose is to provide assistance and guidance in all aspects related to the use of HPC resources located at LANL, LLNL, Sandia (and SDSC)
3
What Does The CRT Do For You?What Does The CRT Do For You?
Assist with the establishment and use of computer accounts
Assist with accessing compute resources
Provide essential HPC user documentation
Provide technical support and referral to in-depth consulting
Conduct monthly telecons to keep Alliance users up-to-date with account, access, policy, scheduling and technical issues, and to address issues with HPC platform usage
Interface with other individuals and groups within the Labs, such as management, networking, system administration, storage, customer support, etc., to facilitate the effective support of Alliance users
Track and facilitate the resolution of problems reported to each Labs' customer support “hotline”
Provide training opportunities
Collect and distribute monthly machine usage statistics
Schedule and support special/dedicated runs
Maintain a balance of machine usage between the Alliances
Conduct annual Alliance visits to discuss HPC resources, user issues and to offer technical consultation and/or training
Showcase Alliance research in the NNSA/ASC research exhibit booth at the annual SC conference
4
HPC Compute Resources Available To The AlliancesHPC Compute Resources Available To The Alliances
5
Computer AccountsComputer Accounts
Alliances need at least one account authorizer. This can be a PI, POC and/or a trustworthy, knowledgeable designee
Account authorizers are responsible for overseeing the accounts and machine usage for all of their Center's users
Each Lab has its own policies, forms and procedures, however there is a single entry portal (sarape.sandia.gov) for requesting an account at any of the 3 labs
Account processing for non-US citizens requires additional time and “paperwork” - allow 30-90 days (plan ahead)
Having a “backup” authorizer is important if the primary authorizer is often not available
The CRI has sent all PSAAP POCs and PIs “quick sheets” for getting started with account requests and account management.
Questions? Contact your CRT representative (depends upon the Lab where the account is requested)
6
Computer AccessComputer Access
To access any machine, you must first have an account on that machine
As with accounts, each lab has its own access policies and procedures
All 3 labs require a valid computer account, ssh and use of a password generating device (cryptocard / one-time token), which is sent to you after your initial account request is approved
Additionally, LLNL requires remote users to access resources through VPN (virtual private network):• Makes your local machine appear to be on the LLNL network
• VPN accounts are included with original account applications
• Requires a one-time software download, install and config - or - simply connect via a web interface
7
User DocumentationUser Documentation
Most of what users need to know is available online via web pages hosted by each of the labs. Recommended starting points:• LLNL
– computing.llnl.gov– computing.llnl.gov/tutorials/lc_resources
• LANL– computing-int.lanl.gov– int.lanl.gov/projects/asci/training/Intro
• Sandia– hpc.sandia.gov– clik.sandia.gov
• SDSC– www.sdsc.edu/us
Access to this information varies:• LLNL, SDSC: most web pages are open – no
authentication required
• Sandia, LANL: most web pages require authentication (need an account setup first)
8
HPC TrainingHPC Training
Training is important – especially for new users
Online tutorials are available (see previous User Documentation links)
Workshops conducted at the Labs are open to Alliance users
Training delivered at your Center or over the Access Grid is also possibleTopics include:• Getting Started Information
• Compilers
• Performance tools, Optimization
• Debuggers
• Parallel programming (MPI, OpenMP, Pthreads…)
• Batch schedulers
• Architectures (Purple, Redstorm, TLCC, etc.)
• Visualization tools
Topic specific, customized training? The CRT can assist here too.
9
Customer Support and Problem TrackingCustomer Support and Problem Tracking
Customer support for technical and accounting issues is available via phone and email during normal business hours:
Problems and questions are tracked via a customer support database application (varies with each Lab).
Most problems/questions are handled via “Tier 1” support – the “hotline” at each Lab.
More in-depth issues are typically referred to local “Tier 2” support – a specialist.
The labs coordinate with hardware and software vendors for issues that require outside “Tier 3” support.
Off-hours support handled by Operations staff
CRT reps coordinate regularly with each other on Tri-lab user issues.
10
Dedicated Runs (DATs)Dedicated Runs (DATs)
Normally, Alliance users share machine usage with other users - jobs are typically submitted to a batch system, queued, and wait their turn for execution.
Additionally, there are limits on the number of nodes and number of hours that a job can use.
Exclusive use of a machine (dedicated application time - DAT) can be requested by any Alliance. For example, at LLNL:• Most weekends are dedicated to Alliance use of the
ALC and UP clusters
• Normal node/time limits are not in effect
• No other user jobs are run - only those of the scheduled Alliance(s)
How to request a DAT:• LLNL: computing.llnl.gov/forms/ASC_dat_form.html
• LANL: email to [email protected]
• Sandia: email to [email protected]
11
CommunicationsCommunications
Monthly telecons and email list ([email protected])• Active participation by all 8 Alliances, LLNL, LANL, Sandia and SDSC
• Forum for discussion/questions on user topics such as accounts, access, technical issues, machine schedules, etc.
• First Wed each month, 1:00pm Pacific time
• Toll-free number hosted by the CRT: 866-914-3976 code: 187522#
• Minutes are distributed via our email list to all Alliances, ASC HQ and various staff & managers within the Labs
• Let us know if you want anyone else at your Center added to our list - initially it includes only your POC and PI
Usage stats• Collected by the CRT and distributed with the telecon minutes
• Present both aggregate and detailed usage (down to the user level) for each Lab (and SDSC).
12
CommunicationsCommunications
Email & phone• Customer support staff at each lab are available for assistance and are
also active in sending out important machine/network status notices.
• The CRT can be contacted directly by any of your Center's users:– Blaise Barney (LLNL) [email protected] 925-422-2578– Rob Cunningham (LANL) [email protected] 505-665-4444 x05704 – Barbara Jennings (Sandia) [email protected] 505-845-8554
Visits• Annual visits (2-4 hrs) to the Alliances by the CRT and Lab customer
support staff:– Focus is on the Alliance users of HPC computing resources– Updates on architectures, policies, future plans at the Labs– Forum for discussing user issues, problems, questions– We can include technical "training" sessions also if desired
• We'll be contacting you soon to setup an initial visit - after your users have accounts - possibly Sep-Oct time frame?
13
Questions?Questions?