Top Banner
Purdue Campus Grid Preston Smith [email protected] Condor Week 2006 April 24, 2006
23

Purdue Campus Grid Preston Smith [email protected] Condor Week 2006 April 24, 2006.

Jan 04, 2016

Download

Documents

Trevor Lucas
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Purdue Campus Grid

Preston Smith

[email protected]

Condor Week 2006

April 24, 2006

Page 2: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Overview

• RCAC– Community Clusters

• Grids at Purdue– Campus– Regional

• NWICG

– National• OSG• CMS Tier-2• NanoHUB• Teragrid

• Future Work

Page 3: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Purdue’s RCAC

• Rosen Center for Advanced Computing– Division of Information Technology at Purdue

(ITaP)– Wide variety of systems: shared memory and

clusters• 352 CPU IBM SP • Five 24-processor Sun F6800s, Two

56-processor Sun E10ks• Five Linux clusters

Page 4: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Linux clusters in RCAC

• Recycled clusters– Systems retired from student labs– Nearly 1000 nodes of single-CPU PIII, P4,

and 2-CPU Athlon MP and EM64T Xeons for general use by Purdue researchers

Page 5: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Community Clusters

• Federate resources at a low level

• Separate researchers buy sets of nodes to federate into larger clusters– Enables larger clusters than a scientist could

support on his own– Leverage central staff and infrastructure

• No need to sacrifice a grad student to be a sysadmin!

Page 6: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Community Clusters

Hamlet308 nodes dual Xeon (3.6 Tflops)308 nodes dual Xeon (3.6 Tflops)

3.06 GHz to 3.2 GHz3.06 GHz to 3.2 GHz

2 GB and 4 GB RAM2 GB and 4 GB RAM

GigE, InfinibandGigE, Infiniband

5 owners (EAS, BIOx2, CMS, EE)5 owners (EAS, BIOx2, CMS, EE)

Macbeth126 nodes dual Opteron (~1 Tflops)126 nodes dual Opteron (~1 Tflops)

1.8 GHz1.8 GHz

4-16GB RAM4-16GB RAM

Infiniband, GigE for IP trafficInfiniband, GigE for IP traffic

7 owners (ME, Biology, HEP Theory)7 owners (ME, Biology, HEP Theory)

Lear512 nodes dual Xeon 64 bit (6.4 512 nodes dual Xeon 64 bit (6.4 Tflops)Tflops)

3.2 GHz3.2 GHz

4GB and 6 GB RAM4GB and 6 GB RAM

GigEGigE

6 owners (EEx2, CMS, Provost, 6 owners (EEx2, CMS, Provost, VPR, Teragrid)VPR, Teragrid)

Page 7: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Community Clusters

• Primarily scheduled with PBS– Contributing researchers are assigned a

queue that can run as many “slots” as they have contributed.

• Condor co-schedules alongside PBS– When PBS is not running a job, a node is fair

game for Condor!• But Condor work is subject to preemption if PBS

assigns work to the node.

Page 8: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Condor on Community Clusters

• All in all, Condor joins together 4 clusters (~2500 CPU) within RCAC.

Page 9: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Grids at Purdue - Campus

• Instructional computing group manages a 1300-node Windows Condor pool to support instruction.– Mostly used by computer graphics classes for

rendering animations• Maya, etc.

– Work in progress to connect Windows pool with RCAC pools.

Page 10: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Grids at Purdue - Campus

• Condor pools around campus– Physics department: 100 nodes, flocked– Envision Center: 48 nodes, flocked

• Potential collaborations– Libraries: ~200 nodes on Windows terminals– Colleges of Engineering: 400 nodes in

existing pool

• Or any department interested in sharing cycles!

Page 11: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Grids at Purdue - Regional

• Northwest Indiana Computational Grid– Purdue West Lafayette– Purdue Calumet– Notre Dame– Argonne Labs

• Condor pools available to NWICG today.

• Partnership with OSG?

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 12: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Open Science Grid

• Purdue active in Open Science Grid– CMS Tier-2 Center– NanoHUB– OSG/Teragrid

Interoperability

• Campus Condor pools accessible to OSG– Condor used for access to extra, non-dedicated

cycles for CMS and is becoming the preferred interface for non-CMS VOs.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 13: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

CMS Tier-2 - Condor

– MC production from UW-HEP ran this spring on RCAC Condor pools.

• Processed 23% or so of entire production.• High rates of preemption, but that’s expected!

– 2006 will see addition of dedicated Condor worker nodes to Tier-2, in addition to PBS clusters.

• Condor running on resilient dCache nodes.

Page 14: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

NanoHUB

Campus Grids

Purdue, GLOW

Grid

Capability Computing

Science Gateway

Workspaces

Research apps

Virtual backends

Virtual Cluster with VIOLIN

VM

Capacity Computing

nanoHUB VO

Middleware

Page 15: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Teragrid

• Teragrid Resource Provider

• Resources offered to Teragrid– Lear cluster– Condor pools– Data collections

Page 16: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Teragrid

• Two current projects active in Condor pools via Teragrid allocations– Database of Hypothetical Zeolite Structures

– CDF Electroweak MC Simulation• Condor-G Glide-in• Great exercise in OSG/TG Interoperability

– Identifying other potential users

Page 17: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Teragrid

• TeraDRE - Distributed Rendering on the Teragrid– Globus, Condor, and IBRIX

FusionFS enables Purdue’s Teragrid site to serve as a render farm

• Maya and other renderers available

Page 18: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Grid Interoperability

“Lear”

Page 19: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Grid Interoperability

• Tier-2 to Tier-2 connectivity via dedicated Teragrid WAN (UCSD->Purdue)

• Aggregating resources at low level makes interoperability easier!– OSG stack available to TG users and vice

versa

• “Bouncer” Globus job forwarder

Page 20: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Future of Condor at Purdue

• Add resources– Continue growth around campus

• RCAC • Other departments

• Add Condor capabilities to resources– Teragrid data portal adding on-demand processing

with Condor now

• Federation– Aggregate Condor pools with other institutions?

Page 21: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

Condor at Purdue

• Questions?

Page 22: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

PBS/Condor Interaction

PBS Prologue

# Prevent new Condor jobs and push any existing ones off#/opt/condor/bin/condor_config_val -rset -startd \

PBSRunning=True > /dev/null/opt/condor/sbin/condor_reconfig -startd > /dev/nullif ( condor_status -claimed -direct $(hostname) 2>/dev/null \

| grep -q Machines )then

condor_vacate > /dev/nullsleep 5

fi

Page 23: Purdue Campus Grid Preston Smith psmith@purdue.edu Condor Week 2006 April 24, 2006.

PBS/Condor Interaction

PBS Epilogue

/opt/condor/bin/condor_config_val -rset -startd \PBSRunning=False > /dev/null

/opt/condor/sbin/condor_reconfig -startd > /dev/null

Condor START Expression in condor_config.local

PBSRunning = False# Only start jobs if PBS is not currently running a jobPURDUE_RCAC_START_NOPBS = ( $(PBSRunning) == False )

START = $(START) && $(PURDUE_RCAC_START_NOPBS)