Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edu Condor Week 2008 - April 30, 2008
23
Embed
Implementing a Central Quill Database in a Large Condor Installation Preston Smith psmith@purdue.edupsmith@purdue.edu Condor Week 2008 - April 30, 2008.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Slide 1
Implementing a Central Quill Database in a Large Condor
Installation Preston Smith psmith@purdue.edupsmith@purdue.edu
Condor Week 2008 - April 30, 2008
Slide 2
Background BoilerGrid Motivation What works well What has been
challenging What just doesnt work Future directions Overview
Slide 3
Purdue Condor Grid (BoilerGrid) Comprised of Linux HPC
clusters, student labs, machines from academic department, and
Purdue regional campuses 8900 batch slots today.. 14,000 batch
slots in a few weeks 2007 - Delivered over 10 million CPU-hours to
high-throughput science to Purdue and national community through
Open Science Grid and TeraGrid BoilerGrid
Slide 4
BoilerGrid - Growth
Slide 5
BoilerGrid - Results
Slide 6
Condor 6.9.4, Quill can store information about all the execute
machines and daemons in a pool Quill now able to store job history
and queue contents in a single, central database. Since December
2007, weve been working to store the state of BoilerGrid in a Quill
installation A Central Quill Database
Slide 7
Why would we want to do such a thing?? Research into the state
of a large distributed system Several at Purdue, collaborators at
Notre Dame Failure analysis/prediction, smart scheduling,
interesting reporting for machine owners events table useful for
user troubleshooting? And one of our familiar gripes - usage
reporting Structural biologists (see earlier today) like to submit
jobs from their desks, too How can we access that job history to
complete the picture of BoilerGrids usage? Motivation
Slide 8
Dell 2850 2x 2.8GHz Xeons (hyperthreaded) Postgres on 4-disk
Ultra320 SCSI RAID-0 5GB RAM The Quill Server
Thousands of hosts pounding a Postgres database is non-trivial
Be sure to turn down QUILL_POLLING_PERIOD Default is 10s - we went
down to 1 hour on execute machines At some level, this is an
exercise in tuning your Postgres server. Quick diversion into
Postgres tuning 101.. What works, but is painful top - 13:45:30 up
23 days, 19:59, 2 users, load average: 563.79, 471.50, 428. Tasks:
804 total, 670 running, 131 sleeping, 3 stopped, 0 zombie Cpu(s):
94.6% us, 2.9% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.4% hi, 2.2% si Mem:
5079368k total, 5042452k used, 36916k free, 10820k buffers Swap:
4016200k total, 68292k used, 3947908k free, 2857076k cached
Slide 11
Assuming that theres enough disk bandwidth. In order to support
2500 simultaneous connections, one must turn up max_connections If
you turn up max_connections, you need ~400 bytes of shared memory
per slot. Currently we have 2G of shared memory allocated
Postgres
Slide 12
Then youll need to turn up shared_buffers 1G currently Dont
forget fsm_pages Postgres WARNING: relation
"public.machines_vertical_history" contains more than
"max_fsm_pages" pages with useful free space HINT: Consider
compacting this relation or increasing the configuration parameter
"max_fsm_pages".
Slide 13
So by now we can withstand the worker nodes reasonably well Add
schedds condor_history returns history from ALL schedds Bug fixed
in 7.0.2 The execute machines create enough load that condor_q is
sluggish Added a 2nd quill database server just for job information
What works, but is painful
Slide 14
If your daemons log a lot to sql.log files, but not writing to
the database.. Database down, etc Your database is in a world of
hurt while it tries to catch up.. What works, but is painful
Slide 15
Many Postgres tuning guides recommend a connection pooler if
you need scads of connections pgpool-II Pgbouncer Tried both, Quill
doesnt seem to like it It *did* reduce load. What Hasnt Worked But,
often locked up the database (idle in transaction), and didnt get
anywhere
Slide 16
Throw hardware at the database! Spindle count seems ok Not I/O
bound (any more) More memory = more connections 16GB? More? More,
faster CPUs We appear to be CPU-bound now Get latest multi-cores
What can we do about it?
Slide 17
Contact Wisconsin and call for rescue What can we do about it?
Hey guys.. This is really hard on the old database Hmm. Lets take a
look.
Slide 18
Todd, Greg, and probably others take a look: Quill always hits
the database, even for unchanged ads Postgres backend does not
prepare SQL queries before submitting Being fixed, Todd is
optimistic Well report with the results as soon as we have them
What can Wisconsin do about it?
Slide 19
Reporting for users Easy access to statistics about who ran on
my machines. Mashups, web portals Diagnostic tools to help users
Troubleshooting, etc. Future Directions