Introduction to HPC and the National CyberInfrastructure Ritu Arora Email: [email protected] October 27, 2014 1
Introduction to HPC and the National CyberInfrastructure
Ritu Arora
Email: [email protected]
October 27, 2014
1
Objectives
• Provide a basic introduction to HPC
• Introduce the audience to XSEDE (the national CyberInfrastructure)
• Introduce the audience to TACC resources
• Provide basic information on using TACC resources
• Time too short for in-depth coverage
• However, the knowledge imparted will be sufficient to get you started at conducting your data management activities on national open-science resources
2
3
Source: www.nature.com
What is HPC?
• High Performance Computing (HPC) is the use of parallel processing techniques for enabling larger computations in shorter turnaround times than your laptops or desktops
• HPC systems, are also known as Supercomputers, currently petaFLOPS range
• It can be expensive to buy your own high-end HPC system and spend money in their operation and maintenance
• Good News: you can get access to HPC and high-end storage systems without involving any direct cost to you through XSEDE – More on XSEDE on next-to-next slide
4
1 petaFLOP (PF) = 1 quadrillion math
operations per second
Serial Versus Parallel
• One job, one processor: a serial solution
• One job, several processors: a parallel solution
• Parallel Programming usually works by breaking problems into pieces, and working on those pieces at the same time
5
High Throughput Computing
• High Throughput Computing (HTC) consists of running many jobs that are typically similar and not highly parallel (no communication is needed between the different instances of the program that are running simultaneously)
• A common example is running a parameter sweep where the same program is run with varying inputs, resulting in hundreds or thousands of executions of the program
• We will be doing HTC in today’s afternoon session
6
XSEDE: Extreme Science and Engineering Discovery Environment
7
8
What can you get from XSEDE? • XSEDE is composed of multiple partner institutions known as
Service Providers (SPs), each of which contributes allocatable services
• Resources include HPC machines, High Throughput Computing (HTC) machines, visualization, data storage, testbeds, & services
– https://www.xsede.org/web/guest/resources/overview
• Extended Collaborative Support Service (ECSS) through which researchers can request to be paired with expert staff members for an extended period (weeks up to a year).
– ECSS staff provide expertise in many areas of advanced CI and can work with a research team to advance their work through that knowledge
• Training and Education Programs
9
10
TACC: Texas Advanced Computing Center
11
12
TACC Resources
13
HPC, HTC, Data Analysis Platforms
Scientific Visualization and Analysis Resources
Data Services
Stampede 6400+ nodes
10 PFLOPs
Wrangler Data Intensive
Science (01/2015)
Lonestar 1800+ nodes 302 TFLOPs
Maverick Vis, Analysis
132 K40 GPUs
Vis Lab 12.4 Megapixel Collaborative Touch display
Ranch Tape Archive
100 PB
Rodeo Cloud Services
User VMs
Corral Data storage and sharing 6 PB Storage
Stampede HPC System – You will use it today!
• Stampede is one of the world’s most powerful supercomputers
• About 6400 compute nodes and additional specialized nodes
• Each node has multiple cores - 522080 processing cores in total
• 270 TB of total memory – Each typical compute node on Stampede has 32GB of memory, or RAM
– There are 16 specialized nodes called large-memory nodes, each having 1TB of memory or RAM
– Note 1: A typical desktop computer has 4-, 8-, or maybe 16GB of RAM
– Note 2: More memory means that researchers can work on problem sizes much larger than they could otherwise using a desktop computer
• 14 PB of high performance storage system
• 75 miles of network cable
14
U
15
A node on Stampede
16
1. Infiniband HCA card 2. Two 8 core Intel Xeon processors 3. Memory 4. Storage/File-System 5. Space for future Expandability using coprocessors or accelerators 6. Intel Xeon Phi Coprocessor
Mike Packard, TACC Senior Systems Administrator, slides one of Stampede’s 6,400
nodes into its cabinet during installation
17
Oversimplified Diagram – Accessing Stampede
18
SSH
Stampede
Login Node (login3)
Login Node (login4)
Login Node (login1)
Login Node (login2)
Typical Compute Nodes (e.g., C201-231, …)
Typical Compute Nodes (e.g., C201-231, …)
Typical Compute Nodes (e.g., C201-231, …)
Specialized Compute Nodes (e.g., large memory nodes,
Visualization Nodes)
File-Systems ($HOME, $WORK, $SCRATCH)
Interconnect
Interconnect
1 1 1 1
1 => job scheduler
Login Nodes on Stampede
19
Component Description 4 login nodes stampede.tacc.utexas.edu
Processors Sockets per Node/Cores per Socket
Intel E5-2680, 2.7GHz 2/8
Motherboard Dell R720, Intel QPI C600 Chipset
Memory Per Node 32GB
$HOME (default directory upon login) Lustre, 5 GB quota
Use login nodes for installing software, compiling your programs, editing files, transferring files, submitting or monitoring batch jobs, starting an interactive session, and running additional light-weight processes.
Compute Nodes on Stampede
• The majority of the 6400 nodes are configured with two Xeon E5-2680 processors and one Intel Xeon Phi SE10P Coprocessor
• These compute nodes are configured with 32GB of "host" memory with an additional 8GB of memory on the Xeon Phi Coprocessor card
• A smaller number of compute nodes are configured with two Xeon Phi Coprocessors
20
Use the compute nodes for running any batch or interactive jobs that would take more than few seconds to complete.
Stampede File-Systems • User-owned storage on the Stampede system is available in three
directories that are identified by $HOME, $WORK and $SCRATCH environment variables
• These directories are separate file systems, and accessible from any node in the system
21
$HOME $WORK $SCRATCH
5 GB quota, maximum 150K files allowed
400 GB quota, maximum 3M files allowed
No Quota Restriction
Backed up Not backed up Not backed up
No purge policy No purge policy Files with access times of greater than 10 days can be purged
Store your source code and build your software here
Store large files here Store large files here
Parallel File System – named Lustre – makes hundreds of Spinning disks act like a single disk
Stampede’s Archival Storage System
• Stampede's tape-based archival storage system is Ranch
• 60 PB capacity, not backed up, not replicated
• Ranch (ranch.tacc.utexas.edu) is accessible from Stampede via the $ARCHIVER and $ARCHIVE variables
– Store permanent files here for archival storage
– This file system is NOT mounted (directly accessible) on any node
– Use it for only for long-term file storage
– You need to stage the data back to Stampede for any usage – cannot directly access the data on Ranch through your jobs
22
How to Get Started?
24
How do you get Started? • In order to use TACC resources (or additional XSEDE resources)
and request for an expert’s help 1. Create an XSEDE portal account and a TACC portal account
1. Note: in order to activate your TACC account, you will need to log into the TACC portal once after getting an email from the TACC user services group
2. PIs should then submit a request for start-up allocation (computing hours) and ECSS staff through the XSEDE portal
3. Once the allocation request is approved, the project PI can add his or her group members (having active portal accounts) to the allocation by logging in to the XSEDE portal
4. Once you have resources and expert/s assigned to your project through XSEDE, you can use the credentials of your TACC portal account to directly log into TACC resources to do your research and development work – you may also use XSEDE Single Sign on Login Hub
25
With the training accounts that you will receive at this workshop, Steps 1 to 3 mentioned above can be temporarily ignored. In the next session you will use the training accounts to connect to TACC resources.
After you are connected
• Once you are connected to a TACC (or another XSEDE) resource remotely, you will need to understand the user environment on those resources – Linux OS (to be discussed in the next session)
– Usage Policies: Resources shared with other users and hence understand the resource usage policies (a slide on it later)
– Bring your data
– Do software installation in your account if what you need is not already available on Stampede
– Do your processing or post-processing
– Move the results to a secondary or tertiary storage media at TACC or at your institution
26
Some of the activities that you avoid on shared resources like Stampede
• Avoid running time-consuming jobs (could be programs or scripts) on the login node – All such jobs should be run on the compute nodes
– Compute nodes can be accessed via a batch job or interactively (more on this in the afternoon sesion)
• Avoid running large jobs from $HOME – Run such jobs from $SCRATCH
• Avoid running more than 2 (or 3) rsync processes simultaneously for data transfer
• Avoid parking your data for months on $SCRATCH without accessing it periodically
27
For Further Information
• TACC resources
– checkout the resource-specific user-guides at TACC website, example, below is the link to the Stampede user-guide
https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide
– Submit tickets through the TACC portal https://portal.tacc.utexas.edu/
• XSEDE resources
– Visit the XSEDE website https://www.xsede.org/
– Submit tickets through the XSEDE portal https://portal.xsede.org/#/guest
28
Thanks for listening!
Any questions or comments?
29