Top Banner
Implementing a Central Quill Database in a Large Condor Installation Preston Smith [email protected] Condor Week 2008 - April 30, 2008
23

Implementing a Central Quill Database in a Large Condor Installation Preston Smith [email protected]@purdue.edu Condor Week 2008 - April 30, 2008.

Dec 25, 2015

Download

Documents

Julia Richards
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

Implementing a Central Quill Database in a Large Condor Installation

Preston Smith [email protected]

Condor Week 2008 - April 30, 2008

Page 2: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Background– BoilerGrid

• Motivation

• What works well

• What has been challenging

• What just doesn’t work

• Future directions

Overview

Page 3: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Purdue Condor Grid (BoilerGrid) – Comprised of Linux HPC clusters, student labs,

machines from academic department, and Purdue regional campuses

• 8900 batch slots today..• 14,000 batch slots in a few weeks

• 2007 - Delivered over 10 million CPU-hours to high-throughput science to Purdue and national community through Open Science Grid and TeraGrid

BoilerGrid

Page 4: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

BoilerGrid - Growth

Page 5: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

BoilerGrid - Results

Page 6: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Condor 6.9.4, – Quill can store information about all the

execute machines and daemons in a pool– Quill now able to store job history and queue

contents in a single, central database.

• Since December 2007, we’ve been working to store the state of BoilerGrid in a Quill installation

A Central Quill Database

Page 7: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Why would we want to do such a thing??– Research into the state of a large distributed system

• Several at Purdue, collaborators at Notre Dame

– Failure analysis/prediction, smart scheduling, interesting reporting for machine owners

– “events” table useful for user troubleshooting?– And one of our familiar gripes - usage reporting

• Structural biologists (see earlier today) like to submit jobs from their desks, too

• How can we access that job history to complete the picture of BoilerGrid’s usage?

Motivation

Page 8: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Dell 2850– 2x 2.8GHz Xeons (hyperthreaded)– Postgres on 4-disk Ultra320 SCSI RAID-0– 5GB RAM

The Quill Server

Page 9: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Getting at usage data!

What works well

quill=> select distinct scheddname,owner,cluster_id,proc_id,remotewallclocktime from jobs_horizontal_history where scheddname LIKE '%bio.purdue.edu%' LIMIT 10; scheddname | owner | cluster_id | proc_id | remotewallclocktime ------------------------+---------+------------+---------+-------------------- epsilon.bio.purdue.edu | jiang12 | 276189 | 0 | 345 epsilon.bio.purdue.edu | jiang12 | 280668 | 0 | 4456 epsilon.bio.purdue.edu | jiang12 | 280707 | 0 | 1209 epsilon.bio.purdue.edu | jiang12 | 280710 | 0 | 1197 epsilon.bio.purdue.edu | jiang12 | 280715 | 0 | 1064 epsilon.bio.purdue.edu | jiang12 | 280717 | 0 | 567 epsilon.bio.purdue.edu | jiang12 | 280718 | 0 | 485 epsilon.bio.purdue.edu | jiang12 | 280720 | 0 | 480 epsilon.bio.purdue.edu | jiang12 | 280721 | 0 | 509 epsilon.bio.purdue.edu | jiang12 | 280722 | 0 | 539(10 rows)

Page 10: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Thousands of hosts pounding a Postgres database is non-trivial– Be sure to turn down QUILL_POLLING_PERIOD

• Default is 10s - we went down to 1 hour on execute machines

– At some level, this is an exercise in tuning your Postgres server.

• Quick diversion into Postgres tuning 101..

What works, but is painful

top - 13:45:30 up 23 days, 19:59, 2 users, load average: 563.79, 471.50, 428.Tasks: 804 total, 670 running, 131 sleeping, 3 stopped, 0 zombieCpu(s): 94.6% us, 2.9% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.4% hi, 2.2% siMem: 5079368k total, 5042452k used, 36916k free, 10820k buffersSwap: 4016200k total, 68292k used, 3947908k free, 2857076k cached

Page 11: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Assuming that there’s enough disk bandwidth….– In order to support 2500 simultaneous

connections, one must turn up max_connections

– If you turn up max_connections, you need ~400 bytes of shared memory per slot.

• Currently we have 2G of shared memory allocated

Postgres

Page 12: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Then you’ll need to turn up shared_buffers– 1G currently

– Don’t forget fsm_pages…

Postgres

WARNING: relation "public.machines_vertical_history" contains more than "max_fsm_pages" pages with useful free spaceHINT: Consider compacting this relation or increasing the configuration parameter "max_fsm_pages".

Page 13: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• So by now we can withstand the worker nodes reasonably well

• Add schedds– condor_history returns history from ALL schedds

• Bug fixed in 7.0.2

– The execute machines create enough load that condor_q is sluggish

– Added a 2nd quill database server just for job information

What works, but is painful

Page 14: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• If your daemons log a lot to sql.log files, but not writing to the database..– Database down, etc– Your database is in a world of hurt while it

tries to catch up..

What works, but is painful

Page 15: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Many Postgres tuning guides recommend a connection pooler if you need scads of connections– pgpool-II– Pgbouncer

• Tried both, Quill doesn’t seem to like it– It *did* reduce load….

What Hasn’t Worked

But, often locked up the database (idle in transaction), and didn’t get anywhere

Page 16: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Throw hardware at the database!– Spindle count seems ok

• Not I/O bound (any more)

– More memory = more connections• 16GB? More?

– More, faster CPUs• We appear to be CPU-bound now• Get latest multi-cores

What can we do about it?

Page 17: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Contact Wisconsin and call for rescue

What can we do about it?

“Hey guys.. This is really hard on the old database”

“Hmm. Let’s take a look.”

Page 18: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Todd, Greg, and probably others take a look:– Quill always hits the database, even for

unchanged ads– Postgres backend does not prepare SQL

queries before submitting

• Being fixed, Todd is optimistic – We’ll report with the results as soon as we

have them

What can Wisconsin do about it?

Page 19: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Reporting for users– Easy access to statistics about who ran on

“my” machines.• Mashups, web portals

– Diagnostic tools to help users• Troubleshooting, etc.

Future Directions

Page 20: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

• Questions?

The End

Page 21: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

Backup slides

Page 22: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.

BoilerGrid - Results

Year Pool Size

Jobs Hours Delivered

Unique Users

2004 1500 43,551 346,000 14

2005 4000 210,717 1,695,000 26

2006 6100 4,251,981 5,527,000 72

2007 7700 9,611,813 9,524,000 117

2008 14000+ ? ? 63 so far..

Page 23: Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.